InfUSER proposes tools for the analysis of Hi-C data. It currently only contains one fucnction, singletree, that aims to infer the state of progenitor cells, given a differentition tree and data of the corresponding data type at the leaves. It can take multiple data types as input, but will need all
inputs of a single run to be of the same data type.
This quickstart example uses the data and input files included in the test folder of the repository. The following takes around 54 seconds to be executed.
infuser singletree -r 100000 -d 1000000 -nj 2 -b False\
-s chr12_subset.txt\
tree_file.tsv\
samples_file.tsv\
'output'\
mm10.chrom.sizes\
chr12
from infuser import single_tree
single_tree(tree_path = 'tree_file.tsv',
sample_file = 'samples_file.tsv',
output_dir = 'output',
chrom_sizes = 'mm10.chrom.sizes',
chromlist = ['chr12'],
res = 100000,
dist = 1000000,
balance = False,
subset = 'chr12_subset.txt',
n_jobs = 2)
Use the following command to clone the repository:
git clone https://github.com/AudreyBaguette/InfUSER.git
InfUSER v1.1.0 has been build in python 3.13. It relies on the following libraries:
- numpy
- pandas
- treelib
- joblib
- cooler
Once those libraries have been installed, InfUSER can be installed from the cloned repo:
cd InfUSER
git checkout v1.1.0
pip install dist/infuser-1.1.0.tar.gz
or directly from GitHub:
pip install https://github.com/AudreyBaguette/InfUSER.git@v1.1.0
Prints the version of the package and its source (this Git page).
Run InfUSER with a single data type.
- Tree topology file (API: tree_path, CLI: TREEPATH)
The tree topology file is a file that records the topology of the differentiation tree to use. The first row contains only one field, the name of the root. The following rows contain two fields, separated by tabs. The first one is the name of a new node, the second is the name of its parent. The tree is contrsucted from the root down, so the specification of parents must be written before its children. The tree is not binary, a parent can have more than two children. Each node name must be unique.
Example:
root
n1 root
n2 root
l1 n1
l2 n1
l3 n1
l4 n2
l5 n2
Gives the following tree:
root
├── n1
│ ├── l1
│ ├── l2
│ └── l3
└── n2
├── l4
└── l5
The file corresponding to the example file above is provided atExamples/tree_file.tsv. A helper function is provided to change a linkage ndarray to the proper format. See section linkage_to_tree for more information. - Samples description file (API: sample_file, CLI: SAMPLEFILE)
The samples files contains two fileds, separated by tabs. The first field is the name of the sample. Samples names must be the same as the one in the topology file. The second field is the path to the corresponding data file (.mcool or .bed). All leaves (terminal nodes) present in the topology file must be present in the samples file with a valid path. If extra samples are present in the samples files and not in the topology, they will be ignored.- For Hi-C:
The samples file corresponding to tree topology file above is provided atExamples/HiC_samples_file.tsv. - For other data types:
The data files need to have the same number of rows, in the same order. The samples file corresponding to tree topology file above is provided atExamples/ChIP_samples_file.tsv.
- For Hi-C:
- Output directory (API: output_dir, CLI: OUTDIR) The path to the output directory (see Outputs section for a description of the created files and the file structure)
- File of chromosome sizes (API: chrom_sizes, CLI: CHROMSIZES) TThe path to the file containing the size (in bp) of each chromosome
- Chromosomes to process (API: chromlist, CLI: CHROMLIST) The names of the chromosomes to consider. This list is ignored if subset is not null.
- Resolution (API: res, CLI: -r/--resolution) The resolution to consider. The input files need to have been generated with that resolution. (Hi-C only, default 10000)
- Subset file (API: subset, CLI: -s/--subset)
The subset file is only used when the input data is Hi-C. It must contain three columns, separated by tabs. The first column is the chromosome name. The second column is the start of the region to consider. The third column is the end of the region to consider. The file must contain a header, with the following names:
seqnames start end
Notes: the start and end coordinates are rounded down to the nearest bin, relative to the resolution. The end bin is excluded. If the start and end region fall within the exact same bin, the region is considered too small and is ignored. An example is provided atExamples/subset_file.tsv. - Maximal distance to consider (API: dist, CLI: -d/--dist) The distance to consider. All interactions beyond that distance will be ignored. If set to 0, all interactions are kept (Hi-C only, default 0)
- Column index (API: column, CLI: -c/--column) The column conting the score to consider. The first column is column 1 (1D data only, default 4)
- Transformation (API: transform, CLI: -t/--transform) The transformation(s) to apply to the matrix. "log1p" and "Z-score" are supported. (default Z-score)
- Balancing (API: balance, CLI: -b/--balance) Should the balanced weights be used (for Hi-C data only) (default True)
- Number of jobs for parallelization (API: n_jobs, CLI: -nj/--njobs) For paralleliation of pixel computation, how many jobs should be run in parallel (default 4)
The output folder will contain one file and three sub-folders:
-
tree_structure.txt
This files records the tree topology constructed from the input file. -
internal_nodes
This folder contains the infered data for the internal nodes of the differentiation tree. It contains one sub-folder per node. The names will match the names in the input files.- For Hi-C:
Each sample sub-folder will contain one file per (computed) chomosome and one .cool file merging them all. - For other data types:
Each sample sub-folder will ontain one tsv file. This only contains one unnamed column, containing the inferred values in the same order as the input file.
- For Hi-C:
-
leaves
This folder contains the transformed data for the leaves (terminal nodes) of the differentiation tree. It contains one sub-folder per node. The names will match the names in the input files.- For Hi-C:
Each sample sub-folder will contain one file per (computed) chomosome and one .cool file merging them all. The .cool files are different from the input files in that they are limited to one resolution, may not contain data genomewide (depending on the chromlist and subset_file), and the contact values went through the same Z-transformed, discretization and de-transformation as the internal nodes. - For other data types:
Each sample sub-folder will ontain one tsv file. This only contains one unnamed column, containing the inferred values in the same order as the input file. The .tsv files are different from the input files in that their contact values went through the same Z-transformed, discretization and de-transformation as the internal nodes.
- For Hi-C:
-
edges
This folder contains the changes between each parent-child pair. Those changes are computed as differences between Z-scores, before de-transformation. The folder contains one sub-folder per edge in the topology. The names will match the names in the input files and record the direction that was considered (from parent to child).- For Hi-C:
Each sub-folder will contain one file per chomosome and one .cool file merging them all. The files do not contain a contact frequency, but differences in contact frequencies. - For other data types:
Each sub-folder will contain one tsv file. The files do not contain a signal value, but differences in signal values.
- For Hi-C:
-
parsimony
- parsimony_scores.cool The final parsimony scores, for each pixel, saved as a matrix in the .cool format.
This helper function helps convert a linkage object, as produced by scipy, to a file of the correct input format. This helper function is available in API only.
- linkage : ndarray The hierarchical clustering encoded as a linkage matrix. The exact format expected is the one produced by scipy.cluster.hierarchy.linkage
- leaves : list of string
The names of the leaves. The leaves are expected to be in the same order as in the lineage object (
leaves[0]is the name of node 0,leaves[1]is the name of node 1, etc) - outfile : string The path where to save the tree file
- exclude : list of string The names of the leaves to exclude. If the list is empty, no leaf is excluded (default [])
The function saves the tree in the correct format to the path specified as parameter.
The following figure represents the necessary runtime across mouse autosomes at different resultions. InfUSER was running on 8 nodes, using 150Gb of RAM each (note that the runs at lower resolutions required less resources). For the 5kb, 10kb and 50kb resolutions, a maximal distance of 3Mb was used. The runs at 100kb, 500kb and 1Mb did not have a maximal distance requested. InfUSER runtime
- Audrey Baguette
- Bonev B, Mendelson Cohen N, Szabo Q, Fritsch L et al. Multiscale 3D Genome Rewiring during Mouse Neural Development. Cell 2017 Oct 19;171(3):557-572.e24. PMID: 29053968
- Zhang, Y., Blanchette, M. Reference panel guided topological structure annotation of Hi-C data. Nat Commun 13, 7426 (2022). https://doi.org/10.1038/s41467-022-35231-3
[TODO]