Quick links: Learn about the PCLAI HPRC v2 BED files: PCLAI on HPRC Release 2 samples | Preprint: biorxiv
Point cloud local ancestry inference (PCLAI) is a deep learning-based approach for inferring continuous population genetic structure along the genome. Instead of assigning each genomic window to a discrete ancestry label, PCLAI predicts:
- A continuous coordinate (e.g., a point in PC1–PC2 space) for every window, and
- A per-window confidence score.
For technical details, please see the preprint.
We are open for collaborations. If you require extra information to run the current pretrained models, please reach to geleta@berkeley.edu.
PCLAI runs on Python 3.12.10. The harmonization pipeline uses bcftools 1.13, Java openjdk 21.0.3, and BEAGLE (version 5.5). We recommend setting up a virtual environment and installing the required modules:
python3 -m pip install -r requirements.txtUsing pre-trained PCLAI relies on two main workflows:
- Harmonization of an arbitrary input VCF into the SNP space expected by a pretrained PCLAI bundle
- Inference using a pretrained bundle on the harmonized VCFs
Currently, we provide four pretrained bundles:
pclai_1kg_bundle_cpu: trained on 1000 Genomes for CPU inference (download).pclai_1kg_bundle_cuda: trained on 1000 Genomes for GPU inference (download).pclai_1kg+hgdp_bundle_cpu: trained on 1000 Genomes + Human Genome Diversity Project (HGDP) for CPU inference (download).pclai_1kg+hgdp_bundle_cuda: trained on 1000 Genomes + Human Genome Diversity Project (HGDP) for GPU inference (download).
The CPU and CUDA bundles contain the same model family, but exported for different devices.
Important
Use the CPU bundle if you will run inference on CPU, and the CUDA bundle if you will run inference on GPU. Exported PyTorch programs are device-specialized, so a CPU bundle is not interchangeable with a CUDA bundle.
Tip
Because these bundles have different numbers of input SNPs, it is often useful to try both and compare which one gives better SNP coverage for your input VCF.
Each bundle follows this structure:
pclai_bundle/
manifest.json
snp_manifests/
chr1_s01.snps.tsv
chr1_s02.snps.tsv
...
chr13.snps.tsv
...
models/
pclainet_chm1_s01.pt2
pclainet_chm1_s02.pt2
...
The snp_manifests/ directory contains the exact ordered SNP list expected by each exported model artifact.
Each .snps.tsv file has this format:
chrom pos rsid ref alt
chr21 5034221 . G A
chr21 5034230 . C T
chr21 5034244 . C T
These manifest files are the source of truth for:
- Expected SNP identity
- Expected SNP order
The models/ directory contains exported PyTorch model artifacts (.pt2) that can be loaded for inference.
The manifest.json file contains the overall bundle metadata, including:
- Chromosome / subset mapping
- Model artifact paths
- SNP manifest paths
- Expected SNP counts
- Bundle-level configuration metadata
Your input VCF must be harmonized to the model space before running inference. In practice, that means:
- GRCh38 coordinates
- Phased haplotypes
- Imputed against the reference panel (1000 Genomes and/or HGDP)
- Harmonized to the SNP order expected by the bundle
If your VCF is already:
- In GRCh38
- Phased
- Imputed
- And already matches the SNP order expected by the bundle
Then you can skip harmonization and run inference directly.
Otherwise, run pipeline.py first.
Warning
It is very likely you will need to run harmonization. Do not run inference without harmonizing your VCF first.
Caution
Running inference without harmonization will lead to incorrect outputs.
The harmonization pipeline does the following for each chromosome:
- Extracts chromosome-specific variants from the input VCF
- Aligns contig naming with the reference panel
- Checks allele concordance against the bundle SNP manifest
- Optionally normalizes the target VCF against a reference FASTA
- Phases / imputes the target VCF (with BEAGLE5)
- emits a final per-chromosome VCF in the exact SNP-manifest order expected by the bundle
You will need:
- BCFtools
- java
- BEAGLE5 (JAR file path)
- A GRCh38 reference FASTA if using normalization (optional)
- Reference panel VCFs for imputation (1000 Genomes and/or HGDP)
Important
The reference panel VCFs used for imputation are not the same thing as the bundle SNP manifests. The bundle defines the final SNP order expected by the model The reference panel VCFs are used by BEAGLE for phasing/imputation
All pretrained bundles assume GRCh38 coordinates.
Important
If your input VCF is not in GRCh38, harmonization will not be sufficient by itself. You must first liftover or otherwise convert the data to GRCh38.
Note
Make sure your VCF contigs include the chr prefix (e.g., chr1 instead of 1): ##contig=<ID=chr1>.
python3 ./pipeline.py \
--input-vcf "/path/to/input.vcf.gz" \
--workdir "/path/to/workdir" \
--bundle-dir "/path/to/pclai_bundle" \
--reference-split-template "/path/to/reference.chr{chrom}.vcf.gz" \
--impute-engine beagle \
--beagle-jar "/path/to/beagle.jar" \
--threads 16 \
--reference-fasta "/path/to/GRCh38.fa" \
--auto-normalize-on-qc-fail \
--log-level DEBUG--bundle-dirpoints to one of the pretrained PCLAI bundles--reference-split-templatepoints to the reference VCFs (1000 Genomes and/or HGDP) used for BEAGLE imputation. The path can be accept any pattern as long as it can iterate through the{chrom}s, because the code just does.format(chrom=chrom)on the string.--beagle-jarmust point to a valid BEAGLE JAR--reference-fastais used for optional target normalization--auto-normalize-on-qc-failretries QC after normalization when needed
Important
The harmonization pipeline assumes the input VCF is indexed and bgzip-compressed (.vcf.gz) or otherwise readable by bcftools. If your input is a plain .vcf, compress and index it first.
For example:
bgzip -c input.vcf > input.vcf.gz
bcftools index -t input.vcf.gzAfter harmonization, you should have per-chromosome VCFs under your work directory:
/path/to/workdir/
chr1/final.for_model.chr1.vcf.gz
chr2/final.for_model.chr2.vcf.gz
...These per-chromosome VCFs are in the exact SNP order expected by the bundle and can be passed directly to the inference runner.
If you have an arbitrary input VCF and want to run our pretrained models:
- Choose a bundle
pclai_1kg_cpupclai_1kg_cudapclai_1kg+hgdp_cpupclai_1kg+hgdp_cuda
- Harmonize your input VCF with
pipeline.pyif needed - Collect the harmonized per-chromosome VCFs into a folder
- Run
inference.pyon that folder - Save the outputs for downstream analysis and plotting
The inference.py module provides a command-line interface with two modes:
run-chrom: run one chromosome VCF against the corresponding chromosome bundlerun-dir: run a directory of chromosome VCFs
python3 inference.py run-chrom \
--bundle-dir /path/to/pclai_bundle_cuda \
--vcf-path /path/to/chr21.vcf.gz \
--chrom 21 \
--device cuda \
--outdir /path/to/output_chr21python3 inference.py run-dir \
--bundle-dir /path/to/pclai_bundle_cuda \
--vcf-dir /path/to/my_vcfs \
--device cuda \
--outdir /path/to/output_allpython3 inference.py run-dir \
--bundle-dir /path/to/pclai_bundle_cuda \
--vcf-dir /path/to/my_vcfs \
--device cuda \
--chroms 21,22 \
--outdir /path/to/output_chr21_chr22Important
Use --device cpu with a CPU bundle and --device cuda with a CUDA bundle.
For illustration purposed, a successful run ends like this:
[done] Summary table:
chrom subset_idx n_required n_matched n_missing match_fraction
0 1 1 1000000 807697 192303 0.807697
1 1 2 1000000 807937 192063 0.807937
2 1 3 45747 37401 8346 0.817562
[save] Writing outputs to: /path/to/output_dir
[save] results -> /path/to/output_dir/results.pkl.gz
[save] results_cp -> /path/to/output_dir/results_cp.pkl.gz
[save] stats -> /path/to/output_dir/stats.tsv
[save] metadata -> /path/to/output_dir/metadata.jsonNote
match_fraction shows the percentage of matching SNPs between your input VCF and the model's. The higher the better.
Caution
Any match_fraction below ~50% should be treated carefully as most of signal is being captured from inputed positions.
Important
If your match_fraction is way below ~50% using our pre-trained models, we recommend training a PCLAI from scratch using your specific set of sites.
Each inference run writes:
output_dir/
results.pkl.gz
results_cp.pkl.gz
stats.tsv
metadata.json
results.pkl.gz: nested dictionary with local ancestry coordinatesresults_cp.pkl.gz: nested dictionary with breakpoint logitsstats.tsv: tabular summary of SNP matching / coverage for each chromosome or subsetmetadata.json: run configuration and paths
from inference import load_inference_outputs
results, results_cp, stats_df, metadata = load_inference_outputs("/path/to/output_dir")results is a nested dictionary with structure:
results[sample_id][chrom]["h1"] -> np.ndarray of shape (n_windows, 2)
results[sample_id][chrom]["h2"] -> np.ndarray of shape (n_windows, 2)Example:
results["ID2462"]["chr21"]["h1"]Returns an array of shape (n_windows, 2), where each row is the 2D coordinate predicted by the model for one window on haplotype 1.
- Column 0: first PCA coordinate
- Column 1: second PCA coordinate
h2 is the same for haplotype 2.
results_cp is a nested dictionary with structure:
results_cp[sample_id][chrom]["h1"] -> np.ndarray of shape (n_windows,)
results_cp[sample_id][chrom]["h2"] -> np.ndarray of shape (n_windows,)Example:
results_cp["KPP2462"]["chr21"]["h1"]returns a 1D array of logits.
The paintings.py script generates two complementary visualizations from saved inference outputs:
- Chromosome paintings: chromosome paintings of local ancestry coordinates along each chromosome
- PCA contour plots: density contours of all inferred windows in PCA space
These commands operate on the outputs previously written by inference.py, so you do not need the model bundle again at this stage.
Important
Before running paintings.py, make sure you already ran inference and saved:
results.pkl.gzresults_cp.pkl.gzstats.tsvmetadata.json
inside a single output_dir.
Note
The painting CLI also needs access to the harmonized chromosome VCFs in order to recover genomic positions for plotting along the chromosomes. That is why --vcf-dir should point to the directory containing the per-chromosome harmonized VCFs.
Chromosome paintings color each genomic window according to its 2D PCLAI coordinate. The color mapping is derived from the PCA reference panel provided by:
--founders-tsv--pca-constructor
Important
The --founders-tsv and --pca-constructor used for plotting should match the reference system used to interpret the model outputs. If you switch reference PCA spaces, the color mapping will change.
These plots are useful for seeing how ancestry coordinates vary along the genome and across haplotypes.
python3 paintings.py paint-chromosomes \
--results-dir /path/to/output_dir/ \
--vcf-dir /path/to/output_dir/ \
--founders-tsv references/references/pca_1kg_reference_panel.tsv \
--pca-constructor references/pca_1kg_constructor.pkl \
--sample-id ID2462 \
--width-per-chrom 0.9 \
--min-figwidth 4 \
--max-figwidth 18 \
--outdir /path/to/output_dir/plotspython3 paintings.py paint-chromosomes \
--results-dir /path/to/output_dir/ \
--vcf-dir /path/to/output_dir/ \
--founders-tsv references/references/pca_1kg_reference_panel.tsv \
--pca-constructor references/pca_1kg_constructor.pkl \
--width-per-chrom 0.9 \
--min-figwidth 4 \
--max-figwidth 18 \
--outdir /path/to/output_dir/plots_allKey parameters for chromosome paintings:
--sample-id: plot a single sample; if omitted, generate plots for all samples inresults.pkl.gz--vcf-dir: directory containing harmonized per-chromosome VCFs used to recover SNP positions--width-per-chrom: horizontal width contribution per chromosome; increase this if multi-chromosome plots feel cramped--min-figwidth: minimum figure width for cases with very few chromosomes--max-figwidth: maximum figure width for cases with many chromosomes--window-size-snps: number of SNPs per model window; this should match the windowing used during inference (our pre-trained models use1000)--dpi: output resolution
Tip
If you are plotting only one chromosome, a small --min-figwidth such as 4 or 5 usually looks better than a wide default figure.
Tip
If labels overlap when plotting many chromosomes, try reducing --width-per-chrom slightly or increasing --max-figwidth.
The chromosome painting command writes, for each sample:
plots/
SAMPLEID.chromosome_painting.png
SAMPLEID.chromosome_painting.pdf
PCA contour plots summarize the density of all windows for one sample in PCA space. These plots are useful for:
- Comparing samples visually
- Assessing how concentrated or diffuse the inferred windows are
- Filtering windows using breakpoint confidence before plotting
python3 paintings.py paint-pca \
--results-dir /path/to/output_dir/ \
--founders-tsv references/references/pca_1kg_reference_panel.tsv \
--pca-constructor references/pca_1kg_constructor.pkl \
--sample-id ID2462 \
--hist-bins 64 \
--kde-sigma 1.5 \
--contour-levels 10 \
--breakpoint-alpha 0.2 \
--outdir /path/to/output_dir/plots_allKey parameters for PCA contour plots:
--sample-id: plot a single sample; if omitted, generate plots for all samples--hist-bins: number of 2D histogram bins before smoothing- lower values: smoother, coarser contours
- higher values: finer, more detailed contours
--kde-sigma: Gaussian smoothing strength- lower values: sharper contours
- higher values: smoother contours
--contour-levels: number of contour levels to draw--breakpoint-alpha: keep only windows with breakpoint score/probability less than or equal to this threshold--weight-gamma: controls how strongly low-breakpoint windows are emphasized when weighting densities--dpi,--figwidth,--figheight: output resolution and figure size
Tip
A good starting point for PCA contour plots is:
--hist-bins 64
--kde-sigma 1.5
--contour-levels 10
Tip
If the contours look noisy, increase --kde-sigma or reduce --hist-bins.
Tip
If you want to focus on the most stable windows only, try --breakpoint-alpha 0.1 or --breakpoint-alpha 0.2.
The PCA contour command writes, for each sample:
plots_all/
SAMPLEID.pca_contour.png
SAMPLEID.pca_contour.pdf
Soon!
Soon!
Soon!
NOTICE: This software is available for use free of charge for academic research use only. Academic users may fork this repository and modify and improve to suit their research needs, but also inherit these terms and must include a licensing notice to that effect. Commercial users, for profit companies or consultants, and non-profit institutions not qualifying as "academic research" should contact geleta@berkeley.edu. This applies to this repository directly and any other repository that includes source, executables, or git commands that pull/clone this repository as part of its function. Such repositories, whether ours or others, must include this notice.
When using the PCLAI method or PCLAI outputs, please cite the following paper:
@article{geleta_pclai_2026,
author = {Geleta, Margarita and Mas Montserrat, Daniel and Ioannidis, Nilah M. and Ioannidis, Alexander G.},
title = {{Point cloud local ancestry inference (PCLAI): continuous coordinate-based ancestry along the genome}},
year = {2026},
journal = {biorxiv},
doi={10.64898/2026.03.23.713813}
}
