Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 16 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,23 +34,23 @@ uv sync --all-extras

As input, the most recent datasets are pulled from the respected sources as part of the data preparation process. All source data are mapped to the GRCh38 build of the human genome.

* **The sequence of the human genome** is dowloaded from [Ensembl](http://www.ensembl.org/info/data/ftp/index.html) (checking for the most recent version).
* **Genome wide association signals** most recent version of the NHGRI-EBI [GWAS catalog](https://www.ebi.ac.uk/gwas/) (checking the most recent version).
* **Gene annotation** the most recent gene coordinates are downloaded from [GENCODE](http://www.gencodegenes.org/releases/current.html).
* **Canonical transcripts** of protein coding genes are defined according to [Ensembl](https://www.ensembl.org/Help/Glossary).
* **Cytological bands** coordinates fetched from the Ensembl [REST API](http://rest.ensembl.org/).
- **The sequence of the human genome** is dowloaded from [Ensembl](http://www.ensembl.org/info/data/ftp/index.html) (checking for the most recent version).
- **Genome wide association signals** most recent version of the NHGRI-EBI [GWAS catalog](https://www.ebi.ac.uk/gwas/) (checking the most recent version).
- **Gene annotation** the most recent gene coordinates are downloaded from [GENCODE](http://www.gencodegenes.org/releases/current.html).
- **Canonical transcripts** of protein coding genes are defined according to [Ensembl](https://www.ensembl.org/Help/Glossary).
- **Cytological bands** coordinates fetched from the Ensembl [REST API](http://rest.ensembl.org/).

More information on the sources can be found in the `config.json` configuration file.

### Step 1 - Pre-processing

```bash
python Prepare_data.py -d data_folder/ -c config.json -s 450 -t 0.5
uv run Prepare-data -d data_folder/ -c config.json -s 450 -t 0.5
```

help output:

```
```text
usage: prepare_data.py [-h] -d DATADIR -c CONFIG -s CHUNKSIZE -t TOLERANCE

This script fetches and parses input data for the genome plotter project
Expand All @@ -67,20 +67,20 @@ optional arguments:
Fraction of a chunk that cannot be N.
```

* *<DATADIR>* folder into which the files are going to be saved.
* *<CONFIG>* JSON file containing the project level configuration. Will be used for multiple scripts
* *<LOGFILE>* information on the run is saved here.
* *<CHUNKSIZE>* the length of non-overlapping window used to pool together to calculate [GC content](https://en.wikipedia.org/wiki/GC-content). In basepairs.
* *<TOLERANCE>* Ns are discarded from the GC content calculation. This float (ranging from 0-1) shows the maximum of Ns in a chunk tolerated. Chunks with too high N ratio is considered as heterochromatic region on the plot.
- **DATADIR** folder into which the files are going to be saved.
- **CONFIG** JSON file containing the project level configuration. Will be used for multiple scripts
- **LOGFILE** information on the run is saved here.
- **CHUNKSIZE** the length of non-overlapping window used to pool together to calculate [GC content](https://en.wikipedia.org/wiki/GC-content). In basepairs.
- **TOLERANCE** Ns are discarded from the GC content calculation. This float (ranging from 0-1) shows the maximum of Ns in a chunk tolerated. Chunks with too high N ratio is considered as heterochromatic region on the plot.


### Step 2 - Generate chromosome plot

```bash
./plot_chromosome.py --help
uv run plot-chromosome --help
```

```
```text
usage: plot_chromosome.py [-h] -c CHROMOSOME [-w WIDTH] [-p PIXEL] [-s DARKSTART] [-m DARKMAX] -f FOLDER [--textFile]
[-g GENEFILE] [-t TEST] [--dummy] --config CONFIG [-l LOGFILE]

Expand Down Expand Up @@ -125,16 +125,15 @@ Finally the `.png` file is saved (and `.svg` file if required).

The `gene_sets/` folder contains a set of files that can be used as gene annotation:

* *kinases_Hs.bed.gz*: list of kinase genes in the human genome, generated by `get_kinases.sh` script.
* *gene_w_drugs.tsv.gz*: list of genes for which approved drugs exists (where the mechanism of action is known) based on [OpenTargets tractability](https://docs.targetvalidation.org/getting-started/target-tractability) data.
- *kinases_Hs.bed.gz*: list of kinase genes in the human genome, generated by `get_kinases.sh` script.
- *gene_w_drugs.tsv.gz*: list of genes for which approved drugs exists (where the mechanism of action is known) based on [OpenTargets tractability](https://docs.targetvalidation.org/getting-started/target-tractability) data.

### Result

The following image was created based on the data of chromosome 20, where 450 bp-s were averaged to get GC content, and 200 of these chunks were plotted in each row. The list of kinase genes are shown on the right side of the chromosome.

<img src="plots/chr20.png" alt="Chromosome 20" height="1500"/>


The combined plot showing the entire genome with all the protein kinases highlighted:

<img src="plots/kinases.png" alt="Protein kinases in the genome" width="1000">
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@
"gwas_data": {
"host": "ftp.ebi.ac.uk",
"path": "/pub/databases/gwas/releases/latest",
"source_file": "gwas-catalog-associations_ontology-annotated.tsv",
"source_file": "gwas-catalog-associations_ontology-annotated-full.zip",
"processed_file": "processed_GWAS.bed.gz",
"release_date": null
}
Expand Down
11 changes: 8 additions & 3 deletions src/genome_plotter/functions/ChromosomePlotter.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@
from __future__ import annotations

import logging
import os

os.environ["DYLD_FALLBACK_LIBRARY_PATH"] = "/opt/homebrew/lib"

import cairosvg
import pandas as pd
Expand Down Expand Up @@ -62,14 +65,16 @@ def __add_centromere(self: ChromosomePlotter) -> None:

# Right mark of the centromere:
centromere_string = f'<path d="M -1 0 \
C 0 {centromere_midpoint}, {cetromere_x/2} {centromere_midpoint}, {cetromere_x/2} {centromere_midpoint} \
C {cetromere_x/2} {centromere_midpoint}, 0 {centromere_midpoint}, -1 {centromere_hight} Z" fill="white"/>\n'
C 0 {centromere_midpoint}, {cetromere_x / 2} {centromere_midpoint}, {cetromere_x / 2} {centromere_midpoint} \
C {cetromere_x / 2} {centromere_midpoint}, 0 {centromere_midpoint}, -1 {centromere_hight} Z" fill="white"/>\n'

half_centromere = f'<g transform="translate(0, {centromere_start})">\n\t{centromere_string}\n</g>\n'

# Generating the other half of the centromoere:
other_half = f'<g transform="rotate(180 0 {centromere_start + centromere_midpoint}) \
other_half = (
f'<g transform="rotate(180 0 {centromere_start + centromere_midpoint}) \
translate(-{self.__width__}, 0)">\n\t{half_centromere}\n</g>\n'
)

# adding both sides of the centromere to the plot:
self.__plot_string__ += (
Expand Down
4 changes: 4 additions & 0 deletions src/genome_plotter/functions/CytobandAnnotator.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,11 @@

from __future__ import annotations

import os

import cairosvg

os.environ["DYLD_FALLBACK_LIBRARY_PATH"] = "/opt/homebrew/lib"
import pandas as pd


Expand Down
4 changes: 3 additions & 1 deletion src/genome_plotter/functions/svg_handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,10 @@

from __future__ import annotations

import os
from typing import Any

os.environ["DYLD_FALLBACK_LIBRARY_PATH"] = "/opt/homebrew/lib"
import cairosvg


Expand Down Expand Up @@ -177,7 +179,7 @@ def draw_line(

if kwargs:
for key, value in kwargs.items():
extra_args += f' {key.replace("_","-")}="{value}"'
extra_args += f' {key.replace("_", "-")}="{value}"'
print(extra_args)
self.__svg__ += self.__svg_line__.format(
x1, y1, x2, y2, stroke, stroke_width, extra_args
Expand Down
18 changes: 10 additions & 8 deletions src/genome_plotter/input_parsers/fetch_ensembl.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,9 @@ def __init__(self: FetchGenome, ensembl_parameters: SourcePrototype) -> None:
Args:
ensembl_parameters (SourcePrototype): Parameters to fetch the data.
"""
assert (
ensembl_parameters.release and ensembl_parameters.path
), "Ensembl release and path needs to be provided."
assert ensembl_parameters.release and ensembl_parameters.path, (
"Ensembl release and path needs to be provided."
)

# These are the parameters required to fetch the data:
self.host = ensembl_parameters.host
Expand Down Expand Up @@ -93,11 +93,7 @@ def parse_genome(
# If there's data in the buffer, save it:
if chrom_data:
# We are skipping non-canonical chromosomes:
if chrom_name and len(chrom_name) < 3:
logger.info(f"Parsing chromosome {chrom_name} is done.")
self.process_chromosome(chrom_data, chrom_name)
else:
logger.info(f"Chromosome {chrom_name} is skipped.")
self.process_chromosome(chrom_data, chrom_name)

# Empty chromosome data buffer:
chrom_data = ""
Expand Down Expand Up @@ -129,6 +125,12 @@ def process_chromosome(
chrom_data (pd.DataFrame): The chromosome sequence data.
chr_name (str): The name of the chromosome.
"""
# Test chromosome name if we want to process or not:
if chr_name is None or len(chr_name) > 3:
logger.warning(f"Non-canoncial chromosome ({chr_name}) is skipped.")
return None

logger.info(f"Processing chromosome: {chr_name}")
raw_data = []
file_name = f"{self.data_folder}/{self.parsed_file.format(chr_name)}"
chunk_size = self.chunk_size
Expand Down
Loading