Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,8 @@ profile_default/

__pypackages__/


/.vscode





/data
/new_data
/method_first_draft
Expand Down Expand Up @@ -67,3 +62,10 @@ __pypackages__/

test
.DS_Store

# Install script output
selphi_env/
install.log

# Paper directory (not part of software release)
paper/
62 changes: 62 additions & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
cff-version: 1.2.0
title: "Selphi: Genotype Imputation via Weighted PBWT"
message: "If you use this software, please cite it using the metadata from this file."
type: software
version: 1.5.3
license: LicenseRef-NonCommercial
repository-code: "https://github.com/omicsedge/selphi"
url: "https://github.com/omicsedge/selphi"
date-released: "2024-07-01"
authors:
- family-names: De Marino
given-names: Adriano
- family-names: Mahmoud
given-names: Abdallah Amr
- family-names: Bohn
given-names: Sandra
- family-names: Lerga-Jaso
given-names: Jon
- family-names: Novković
given-names: Biljana
- family-names: Manson
given-names: Charlie
- family-names: Loguercio
given-names: Salvatore
- family-names: Terpolovsky
given-names: Andrew
- family-names: Matushyn
given-names: Mykyta
- family-names: Torkamani
given-names: Ali
- family-names: Yazdi
given-names: Puya G.
preferred-citation:
type: article
title: "Empowering GWAS Discovery through Enhanced Genotype Imputation"
authors:
- family-names: De Marino
given-names: Adriano
- family-names: Mahmoud
given-names: Abdallah Amr
- family-names: Bohn
given-names: Sandra
- family-names: Lerga-Jaso
given-names: Jon
- family-names: Novković
given-names: Biljana
- family-names: Manson
given-names: Charlie
- family-names: Loguercio
given-names: Salvatore
- family-names: Terpolovsky
given-names: Andrew
- family-names: Matushyn
given-names: Mykyta
- family-names: Torkamani
given-names: Ali
- family-names: Yazdi
given-names: Puya G.
doi: "10.1101/2023.12.18.23300143"
url: "https://www.medrxiv.org/content/10.1101/2023.12.18.23300143v2"
year: 2023
journal: "medRxiv"
216 changes: 150 additions & 66 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,106 +1,190 @@
# Selphi Imputation
# Selphi Imputation

<img src="https://github.com/selfdecode/rd-imputation-selphi/blob/master/icons/SDBlueIcon.svg" alt="SelfDecode" style="width: 50px; height: auto;"><img src="https://github.com/selfdecode/rd-imputation-selphi/blob/master/icons/OmicsEdge-Logo.png" alt="OmicsEdge" style="width: 290px; height: auto;">
<img src="icons/SDBlueIcon.svg" alt="SelfDecode" style="width: 50px; height: auto;"> <img src="icons/OmicsEdge-Logo.png" alt="OmicsEdge" style="width: 290px; height: auto;">

Selphi is a tool for genotype imputation based on weighted-PBWT (Positional Burrows-Wheeler Transform) algorithm. It provides efficient imputation of missing genotypes in a target sample dataset using a reference panel.
Selphi is a tool for genotype imputation based on a weighted-PBWT (Positional Burrows-Wheeler Transform) algorithm. It provides efficient imputation of missing genotypes in a target sample dataset using a reference panel, processing entire chromosomes in a single pass to avoid edge effects from windowed approaches.

## Selphi Imputation Assumptions
## Quick start

Please take note of the following assumptions for the effective functioning of Selphi Imputation:
A tiny example dataset is included in the [`example/`](example/) directory (chr22, 100 reference samples, 2 target samples). After installing Selphi with any of the methods below, you can run:

1. `Site Compatibility`: The input/unimputed/chip-sites dataset should only consist of sites that are a subset of the sites available in the reference panel dataset.
```bash
selphi \
--target example/data/target.vcf.gz \
--refpanel example/selphi_ref/chr22 \
--map example/data/genetic_map.chr22.map \
--outvcf example/results/imputed \
--cores 4
```

See [`example/README.md`](example/README.md) for full details.

## Architecture

<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="img/architecture_dark.svg"/>
<source media="(prefers-color-scheme: light)" srcset="img/architecture_light.svg"/>
<img src="img/architecture_dark.svg" alt="Selphi pipeline architecture" width="800"/>
</picture>
</p>

2. `Chromosome Consistency`: Both the input/unimputed/chip-sites dataset and the reference panel dataset must pertain to the same chromosome. Make sure they align correctly.
## Assumptions

It is essential to adhere to these assumptions to ensure the proper functioning of Selphi Imputation. Failure to meet these requirements may result in undefined behavior during the imputation process. Please verify these conditions before proceeding with the imputation.
1. **Site Compatibility**: The target dataset should only contain sites that are a subset of the reference panel.
2. **Chromosome Consistency**: Both the target and reference panel must pertain to the same chromosome.
3. **Phased Genotypes**: All target genotypes must be phased.

## Installation

1. Make sure you have Docker installed on your system
2. Get the prebuilt Docker image: `docker pull ghcr.io/omicsedge/selphi:latest`
3. Run the Selphi container with the desired options to perform genotype imputation
### Option 1: Install script (recommended)

## Usage
The install script builds all dependencies (htslib, bcftools, pbwt, Python packages) into a self-contained directory. Supports Linux (Ubuntu/Debian, CentOS/RHEL/Fedora, Arch) and macOS.

```bash
# Install system prerequisites (Ubuntu/Debian)
sudo apt-get update && sudo apt-get install -y \
gcc g++ make autoconf automake git curl pkg-config \
zlib1g-dev libbz2-dev liblzma-dev libzip-dev libcurl4-openssl-dev \
python3 python3-pip python3-venv

# Install system prerequisites (macOS)
xcode-select --install
brew install autoconf automake git curl pkg-config xz libzip python@3.11

# Run the installer
./install.sh # installs to ./selphi_env
./install.sh /opt/selphi # or specify a custom prefix

# After installation
export PATH="/path/to/selphi_env/bin:$PATH"
selphi --help
```

To run Selphi, use the following command:
The installer also supports these options:
- `--skip-xsqueezeit` — skip optional xSqueezeIt (only needed for `.xsi` reference panels)
- `--skip-python` — skip Python venv setup (use system Python instead)
- `--python /path/to/python3` — use a specific Python 3.10+ executable
- `-j N` — set parallel build jobs (default: auto-detect)

### Option 2: Docker

Build the Docker image from the included Dockerfile:

```bash
docker build -t selphi .
docker run selphi --help
```
docker run selphi [options]

### Option 3: Singularity/Apptainer

For HPC environments where Docker is not available, build a Singularity image from the Dockerfile:

```bash
singularity build selphi.sif docker-daemon://selphi:latest
singularity run selphi.sif --help
```

Running the container without specifying any command will display the help message, which outlines the available options and their usage.
This requires building the Docker image first (Option 2), then converting it. Alternatively, build directly from the Dockerfile using Apptainer:

### Options
```bash
apptainer build selphi.sif docker-daemon://selphi:latest
```

## Usage

### 1. Prepare the reference panel

Convert a phased VCF/BCF reference panel to Selphi's internal formats (`.pbwt`, `.samples`, `.sites`, `.srp`):

- `--refpanel REFPANEL` (required): Specifies the location of reference panel files
- `--target TARGET`: Path to the VCF/BCF file containing the target samples
- `--map MAP`: Path to the genetic map file in Plink format
- `--outvcf OUTVCF`: Path to the output file for storing the imputed data in compressed VCF format, the `.vcf.gz` extension will be automatically added
- `--cores CORES`: Number of available cores for parallel processing (default: 1)
- `--prepare_reference`: Convert the reference panel to PBWT and SRP formats
- `--ref_source_vcf REF_SOURCE_VCF`: Location of the VCF/BCF file containing the reference panel
- `--ref_source_xsi REF_SOURCE_XSI`: Location of xsi files containing reference panel, cannot be used in combination with `--ref_source_vcf`
- `--pbwt_path PBWT_PATH`: Path to the PBWT library
- `--tmp_path TMP_PATH`: Location to create a temporary directory
- `--match_length MATCH_LENGTH`: Minimum PBWT match length
- `--est_ne`: Estimated population size (default: 1000000)
- `--no_core_reduction`: Turn off automatic reduction of cores to limit HMM memory usage


### How to Prepare the reference panel
run:
```bash
docker run -v /path/to/data:/data -it selphi \
# Standalone
selphi \
--prepare_reference \
--ref_source_vcf /data/<refpanel-for-imputation.vcf.gz> \
--refpanel /data/<refpanel> \
--cores <n-cores>
```
--ref_source_vcf /path/to/refpanel.vcf.gz \
--refpanel /path/to/output_prefix \
--cores 16

This command will generate 4 files: `refpanel.pbwt`, `refpanel.samples`, `refpanel.sites`, `refpanel.srp`. Creating these files can be time-intensive for large reference panels, so we recommend these files be saved for future use. They can also be created at the time of imputation by including the `--prepare_reference` and `--ref_source_vcf` flags.
Multiple cores will linearly decrease the time to create the `.srp` file, but this process can be memory-intensive, limiting the number of cores that can be used.
# Docker
docker run -v /path/to/data:/data selphi \
--prepare_reference \
--ref_source_vcf /data/refpanel.vcf.gz \
--refpanel /data/output_prefix \
--cores 16
```

### Target samples
This generates 4 files: `output_prefix.pbwt`, `output_prefix.samples`, `output_prefix.sites`, `output_prefix.srp`. These files can be reused across imputation runs. Multiple cores linearly decrease `.srp` creation time but increase memory usage.

- Only one chromosome per file, and chromosome must match the reference panel
- All genotypes must be phased
- All variants in the target file not be found in the reference panel will be automatically added to the end of the imputation process
### 2. Run imputation

### Selphi imputation command
```bash
docker run -v /path/to/data:/data -it selphi \
--target <input-samples.vcf.gz> \
--refpanel <refpanel> \
--map <genetic-map-in-plink-format.map> \
--outvcf <output-imputed-samples> \
--cores <n-cores>
# Standalone
selphi \
--refpanel /path/to/refpanel_prefix \
--target /path/to/target.vcf.gz \
--map /path/to/genetic_map.chrN.GRCh38.map \
--outvcf /path/to/output \
--cores 16

# Docker
docker run -v /path/to/data:/data selphi \
--refpanel /data/refpanel_prefix \
--target /data/target.vcf.gz \
--map /data/genetic_map.chrN.GRCh38.map \
--outvcf /data/output \
--cores 16
```

## Contributing
### Options

| Option | Description |
|--------|-------------|
| `--refpanel REFPANEL` | Location of reference panel files (required) |
| `--target TARGET` | Path to VCF/BCF containing target samples |
| `--map MAP` | Path to genetic map in plink format |
| `--outvcf OUTVCF` | Output path for imputed VCF (`.vcf.gz` added automatically) |
| `--cores CORES` | Number of cores for parallel processing (default: 1) |
| `--prepare_reference` | Convert reference panel to PBWT and SRP formats |
| `--ref_source_vcf` | VCF/BCF reference panel source (with `--prepare_reference`) |
| `--ref_source_xsi` | XSI reference panel source (with `--prepare_reference`) |
| `--pbwt_path` | Path to pbwt binary (auto-detected by install script) |
| `--tmp_path` | Location for temporary files |
| `--match_length` | Minimum PBWT match length |
| `--est_ne` | Estimated effective population size (default: 1000000) |
| `--no_core_reduction` | Disable automatic core reduction to limit HMM memory |
| `--chunk_size` | Chunk size for reference panel creation |

### Target file requirements

- One chromosome per file, matching the reference panel
- All genotypes must be phased
- Variants not found in the reference panel are automatically appended to the output

## Documentation

If you encounter any issues or have suggestions for improvements, please feel free to contribute by submitting a pull request or creating an issue in the GitHub repository.
- **[`example/`](example/)** — Tiny example dataset for quick pipeline validation
- **[`docs/SRP_FORMAT.md`](docs/SRP_FORMAT.md)** — Full specification of the `.srp` (Sparse Reference Panel) file format, including archive structure, chunk layout, sparse matrix encoding, and access patterns

### Development
## Genetic maps

* Make a PR with your changes
* Get your PR reviewed and merged
* Switch to master branch
* `poetry run cz bump`
* `git push --follow-tags`
* [Create new release](https://github.com/selfdecode/rd-imputation-selphi/releases).
Selphi requires a genetic map in plink format for recombination rate interpolation. GRCh38 maps are available from the [Eagle repository](https://alkesgroup.broadinstitute.org/Eagle/downloads/tables/).

## Contributing

If you encounter any issues or have suggestions for improvements, please submit a pull request or create an issue in the GitHub repository.

## Reference

If you use Selphi in your research, please cite:

```
Empowering GWAS Discovery through Enhanced Genotype Imputation
Adriano De Marino, Abdallah Amr Mahmoud, Sandra Bohn, Jon Lerga-Jaso, Biljana Novković, Charlie Manson, Salvatore Loguercio, Andrew Terpolovsky, Mykyta Matushyn, Ali Torkamani, Puya G. Yazdi
Adriano De Marino, Abdallah Amr Mahmoud, Sandra Bohn, Jon Lerga-Jaso,
Biljana Novković, Charlie Manson, Salvatore Loguercio, Andrew Terpolovsky,
Mykyta Matushyn, Ali Torkamani, Puya G. Yazdi
medRxiv 2023.12.18.23300143; doi: https://doi.org/10.1101/2023.12.18.23300143
```
The full project description can be found in the [PrePrint version](https://www.medrxiv.org/content/10.1101/2023.12.18.23300143v2)

# Non-Commercial Use License
### Version 1.0
## Non-Commercial Use License (v1.0)

## NOTICE
This software is provided free of charge for **academic research use only**. Any use by **commercial entities, for-profit organizations, or consultants** is strictly prohibited without prior authorization. For inquiries about commercial licensing, contact **pyazdi@gmail.com**.
This software is provided free of charge for **academic research use only**. Any use by commercial entities, for-profit organizations, or consultants is strictly prohibited without prior authorization. For inquiries about commercial licensing, contact **pyazdi@gmail.com**.
1 change: 1 addition & 0 deletions apps/selphi-imputation/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
resources/home/
9 changes: 9 additions & 0 deletions apps/selphi-imputation/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
all:
mkdir -p resources/home/dnanexus/selphi/modules
cp ../../selphi.py resources/home/dnanexus/selphi/
cp ../../pyproject.toml resources/home/dnanexus/selphi/
cp ../../requirements.txt resources/home/dnanexus/selphi/
cp ../../modules/*.py resources/home/dnanexus/selphi/modules/

clean:
rm -rf resources/home
31 changes: 0 additions & 31 deletions apps/selphi-imputation/Readme.developer.md

This file was deleted.

Loading
Loading