InfUSER

About

InfUSER proposes tools for the analysis of Hi-C data. It currently only contains one fucnction, singletree, that aims to infer the state of progenitor cells, given a differentition tree and data of the corresponding data type at the leaves. It can take multiple data types as input, but will need all inputs of a single run to be of the same data type.

Quick Start

This quickstart example uses the data and input files included in the test folder of the repository. The following takes around 54 seconds to be executed.

CLI

infuser singletree -r 100000 -d 1000000 -nj 2 -b False\
 -s chr12_subset.txt\
 tree_file.tsv\
 samples_file.tsv\
 'output'\
 mm10.chrom.sizes\
 chr12

API

from infuser import single_tree
single_tree(tree_path = 'tree_file.tsv',
            sample_file = 'samples_file.tsv',
            output_dir = 'output',
            chrom_sizes = 'mm10.chrom.sizes',
            chromlist = ['chr12'],
            res = 100000,
            dist = 1000000,
            balance = False,
            subset = 'chr12_subset.txt',
            n_jobs = 2)

Cloning the repository

Use the following command to clone the repository:

git clone https://github.com/AudreyBaguette/InfUSER.git

Installing `InfUSER` and its dependencies

InfUSER v1.1.0 has been build in python 3.13. It relies on the following libraries:

numpy
pandas
treelib
joblib
cooler

Once those libraries have been installed, InfUSER can be installed from the cloned repo:

cd InfUSER
git checkout v1.1.0
pip install dist/infuser-1.1.0.tar.gz

or directly from GitHub:

pip install https://github.com/AudreyBaguette/InfUSER.git@v1.1.0

Usage

info

Prints the version of the package and its source (this Git page).

singletree

Run InfUSER with a single data type.

Required inputs

Tree topology file (API: tree_path, CLI: TREEPATH)
The tree topology file is a file that records the topology of the differentiation tree to use. The first row contains only one field, the name of the root. The following rows contain two fields, separated by tabs. The first one is the name of a new node, the second is the name of its parent. The tree is contrsucted from the root down, so the specification of parents must be written before its children. The tree is not binary, a parent can have more than two children. Each node name must be unique.
Example:
root
n1 root
n2 root
l1 n1
l2 n1
l3 n1
l4 n2
l5 n2
Gives the following tree:
root
├── n1
│ ├── l1
│ ├── l2
│ └── l3
└── n2
├── l4
└── l5
The file corresponding to the example file above is provided at Examples/tree_file.tsv. A helper function is provided to change a linkage ndarray to the proper format. See section linkage_to_tree for more information.
Samples description file (API: sample_file, CLI: SAMPLEFILE)
The samples files contains two fileds, separated by tabs. The first field is the name of the sample. Samples names must be the same as the one in the topology file. The second field is the path to the corresponding data file (.mcool or .bed). All leaves (terminal nodes) present in the topology file must be present in the samples file with a valid path. If extra samples are present in the samples files and not in the topology, they will be ignored.
- For Hi-C:
  The samples file corresponding to tree topology file above is provided at Examples/HiC_samples_file.tsv.
- For other data types:
  The data files need to have the same number of rows, in the same order. The samples file corresponding to tree topology file above is provided at Examples/ChIP_samples_file.tsv.
Output directory (API: output_dir, CLI: OUTDIR) The path to the output directory (see Outputs section for a description of the created files and the file structure)
File of chromosome sizes (API: chrom_sizes, CLI: CHROMSIZES) TThe path to the file containing the size (in bp) of each chromosome
Chromosomes to process (API: chromlist, CLI: CHROMLIST) The names of the chromosomes to consider. This list is ignored if subset is not null.

Optional inputs

Resolution (API: res, CLI: -r/--resolution) The resolution to consider. The input files need to have been generated with that resolution. (Hi-C only, default 10000)
Subset file (API: subset, CLI: -s/--subset) The subset file is only used when the input data is Hi-C. It must contain three columns, separated by tabs. The first column is the chromosome name. The second column is the start of the region to consider. The third column is the end of the region to consider. The file must contain a header, with the following names:
seqnames start end
Notes: the start and end coordinates are rounded down to the nearest bin, relative to the resolution. The end bin is excluded. If the start and end region fall within the exact same bin, the region is considered too small and is ignored. An example is provided at Examples/subset_file.tsv.
Maximal distance to consider (API: dist, CLI: -d/--dist) The distance to consider. All interactions beyond that distance will be ignored. If set to 0, all interactions are kept (Hi-C only, default 0)
Column index (API: column, CLI: -c/--column) The column conting the score to consider. The first column is column 1 (1D data only, default 4)
Transformation (API: transform, CLI: -t/--transform) The transformation(s) to apply to the matrix. "log1p" and "Z-score" are supported. (default Z-score)
Balancing (API: balance, CLI: -b/--balance) Should the balanced weights be used (for Hi-C data only) (default True)
Number of jobs for parallelization (API: n_jobs, CLI: -nj/--njobs) For paralleliation of pixel computation, how many jobs should be run in parallel (default 4)

Outputs

The output folder will contain one file and three sub-folders:

tree_structure.txt
This files records the tree topology constructed from the input file.
internal_nodes
This folder contains the infered data for the internal nodes of the differentiation tree. It contains one sub-folder per node. The names will match the names in the input files.
- For Hi-C:
  Each sample sub-folder will contain one file per (computed) chomosome and one .cool file merging them all.
- For other data types:
  Each sample sub-folder will ontain one tsv file. This only contains one unnamed column, containing the inferred values in the same order as the input file.
leaves
This folder contains the transformed data for the leaves (terminal nodes) of the differentiation tree. It contains one sub-folder per node. The names will match the names in the input files.
- For Hi-C:
  Each sample sub-folder will contain one file per (computed) chomosome and one .cool file merging them all. The .cool files are different from the input files in that they are limited to one resolution, may not contain data genomewide (depending on the chromlist and subset_file), and the contact values went through the same Z-transformed, discretization and de-transformation as the internal nodes.
- For other data types:
  Each sample sub-folder will ontain one tsv file. This only contains one unnamed column, containing the inferred values in the same order as the input file. The .tsv files are different from the input files in that their contact values went through the same Z-transformed, discretization and de-transformation as the internal nodes.
edges
This folder contains the changes between each parent-child pair. Those changes are computed as differences between Z-scores, before de-transformation. The folder contains one sub-folder per edge in the topology. The names will match the names in the input files and record the direction that was considered (from parent to child).
- For Hi-C:
  Each sub-folder will contain one file per chomosome and one .cool file merging them all. The files do not contain a contact frequency, but differences in contact frequencies.
- For other data types:
  Each sub-folder will contain one tsv file. The files do not contain a signal value, but differences in signal values.
parsimony
- parsimony_scores.cool The final parsimony scores, for each pixel, saved as a matrix in the .cool format.

linkage_to_tree

This helper function helps convert a linkage object, as produced by scipy, to a file of the correct input format. This helper function is available in API only.

Required inputs

linkage : ndarray The hierarchical clustering encoded as a linkage matrix. The exact format expected is the one produced by scipy.cluster.hierarchy.linkage
leaves : list of string The names of the leaves. The leaves are expected to be in the same order as in the lineage object (leaves[0] is the name of node 0, leaves[1] is the name of node 1, etc)
outfile : string The path where to save the tree file

Optional inputs

exclude : list of string The names of the leaves to exclude. If the list is empty, no leaf is excluded (default [])

Outputs

The function saves the tree in the correct format to the path specified as parameter.

Runtime

The following figure represents the necessary runtime across mouse autosomes at different resultions. InfUSER was running on 8 nodes, using 150Gb of RAM each (note that the runs at lower resolutions required less resources). For the 5kb, 10kb and 50kb resolutions, a maximal distance of 3Mb was used. The runs at 100kb, 500kb and 1Mb did not have a maximal distance requested. InfUSER runtime

Contributors

Audrey Baguette

References

Bonev B, Mendelson Cohen N, Szabo Q, Fritsch L et al. Multiscale 3D Genome Rewiring during Mouse Neural Development. Cell 2017 Oct 19;171(3):557-572.e24. PMID: 29053968
Zhang, Y., Blanchette, M. Reference panel guided topological structure annotation of Hi-C data. Nat Commun 13, 7426 (2022). https://doi.org/10.1038/s41467-022-35231-3

Citing `InfUSER_single_tree`

[TODO]

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
dist		dist
src/infuser		src/infuser
test		test
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
processing_time_InfUSER.pdf		processing_time_InfUSER.pdf
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InfUSER

About

Quick Start

CLI

API

Cloning the repository

Installing `InfUSER` and its dependencies

Usage

info

singletree

Required inputs

Optional inputs

Outputs

linkage_to_tree

Required inputs

Optional inputs

Outputs

Runtime

Contributors

References

Citing `InfUSER_single_tree`

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

InfUSER

About

Quick Start

CLI

API

Cloning the repository

Installing InfUSER and its dependencies

Usage

info

singletree

Required inputs

Optional inputs

Outputs

linkage_to_tree

Required inputs

Optional inputs

Outputs

Runtime

Contributors

References

Citing InfUSER_single_tree

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Installing `InfUSER` and its dependencies

Citing `InfUSER_single_tree`

Packages