This is the code repository for https://www.biorxiv.org/content/10.1101/2024.06.16.599201v1.
Computational methods in biology can infer large molecular interaction networks from multiple data modalities and resolutions, creating unprecedented opportunities to better understand complex biological phenomena. Such graphs can be built from different conditions and get contrasted to uncover graph-level differences, e.g., a case-control study utilizing gene regulatory networks.
Towards this end, we introduce node2vec2rank, a method for graph differential analysis that ranks nodes according to the disparities of their representations in joint latent embedding spaces. Unlike previous bag-of-features approaches, we leverage recent advances in machine learning and statistics to compare graphs in higher-order structures and in a data-driven manner. Node2vec2rank is computationally efficient and can provably identify the correct ranking of differences. Furthermore, we incorporate stability into n2v2r for an overall procedure that adheres to veridical data science principles by running it multiple times and aggregating the results. See figure below for a demonstration simulating a case-control study.
The method is not limited to comparing only two graphs. Given multiple graphs, there exist strategies such as one-versus-rest that compute pairwise comparisons based on vector arithmetic on the multi-layer node embeddings (see practicalities).
Note, that in our case, node2vec in the title does not correspond to the algorithm by Grover and Leskovec, but rather to any algorithm that can produce (multi-layer) node embeddings. An earlier version of node2vec2rank was using node2vec with transfer learning, but it was unstable and with many paramaters. UASE is based on SVD and is more stable, more efficient, has practically no parameters, and we were able to provide theoretical guarantees about the correct ranking.
While the method is motivated and validated with biological applications, it can be used in any other domain with similar objectives.
This repository provides the method, source code, and example notebooks. In particular, we provide the notebooks corresponding to the biological applications used in the paper, as well as a demo notebook for the general usage.
To get a local copy up and running follow steps below.
It is recommended to create a new virtual enviroment with conda.
- Clone the repo
git clone https://github.com/pmandros/node2vec2rank
- Change to the project repositry
cd node2vec2rank - Run the following command to create a conda enviroment named n2v2r with the necessary packages. Please note that it might take a few minutes for the environment to be created
conda env create --file environment.yaml
- Activate the environment
conda activate n2v2r
- Install node2vec2rank
pip install .
To run the node2vec2rank algorithm in command line, run the following command with a configuration file as input
python -m node2vec2rank.node2vec2rank --config configs/config_demo_adj_CLI.jsonThe configuration file template is as follows
{
"data_io": {
"save_dir": "<path_to_output>",
"data_dir": "<path_to_data>",
"graph_filenames": ["network_control.tsv","network_case.tsv"],
"seperator": "\t",
"is_edge_list": false,
"transpose": false
},
"data_preprocessing": {
"project_unipartite_on": null,
"threshold": null,
"top_percent_keep": 100,
"binarize": false,
"absolute": false
},
"fitting_ranking": {
"embed_dimensions": [4,6,8,10,12,14,16,18,20,22,24],
"distance_metrics": ["euclidean","cosine"],
"seed": null,
"verbose": 1
}
}The configuration parameters have the following functionality
data_io:
--save_dir SAVE_DIR Save directory
--graph_filenames [GRAPH_FILENAMES ...]
Graph filenames
--data_dir DATA_DIR Data Directory
--separator SEPARATOR
Separator used in the graph files
--is_edge_list Whether the input is an edge list or tabular
--transpose Whether to transpose the graph adjacency matrices or not if bipartite
data_preprocessing:
--project_unipartite_on PROJECT_UNIPARTITE_ON
If the graph adjacency matrices are non-square (i.e., bipartite), it will make them square by projecting into column or row space
--threshold THRESHOLD
Everything below this value will be 0
--top_percent_keep TOP_PERCENT_KEEP
Keeps the top percentage of edges, turning the rest to 0
--binarize Whether to binarize the graphs, turning everything above 0 to 1
--absolute Absolute the graphs, i.e., turn negative values into positive
fitting_ranking:
--embed_dimensions [EMBED_DIMENSIONS ...]
A list of all the embedding dimensions to use in n2v2r
--distance_metrics [DISTANCE_METRICS ...]
A list of all the distance metrics to use in n2v2r (currently supporting "cosine" and/or "Euclidean")
--seed SEED Random seed
--verbose VERBOSE Verbose levelYou can also run the code in jupyter notebook. Details about setting up your own workflow in jupyter notebook can be found in the notebooks provided. Check the demo notebook for general usage.
The input files can be either in adjacency format with index and header, or a weighted edge list (three columns of source target and edge weight) without header (the latter supported by networkx). At the moment, n2v2r accepts a list of symmetric dense numpy matrices as input, so the above input files will be converted by the Dataloader accordingly automatically. If your graphs are bipartite and in adjacency format (i.e., non-square matrices), they will be projected to unipartite with multiplication depending on the PROJECT_UNIPARTITE_ON parameter. As an example, if the graphs are in adjacency format with gene regulators in rows and genes in the columns, PROJECT_UNIPARTITE_ON = 'columns' will create symmetric networks by projecting the bipartite networks to gene space. If you want to work directly with bipartite graphs, they should be represented in edge list format and not adjacency. Note that internally the graphs are converted to scipy sparse format prior to the SVD.
The output (i.e., node rankings) and config file are written to disk in the folder specified in the config file with a timestamp attached. The node rankings are all dataframes but not sorted, rather the index follows the original node order as returned by the Dataloader.
Regarding the parameters embed_dimensions and distance_metrics, node2vec2rank runs multiple times for every parameter combination and then all the rankings are aggregated into one using the Borda scheme. The default parameter settings have been tested thoroughly. The data_preprocessing parameters are more involved. In a nutshell, they perform graph preprocessing such as binarization, sparsifying by keeping top edges, absoluting, thresholding, and projecting bipartite graphs to unipartite. We highly recommend performing your own preprocessing and using the resulting networks with n2v2r. Otherwise, check the network_transform function and the order of network transformations.
When the input is more than 2 graphs, there exists three different strategies to compare the graphs: one-vs-rest, sequential, one-vs-before. The last two imply some notion of ordering, e.g., one-vs-before implies strictly ordered graphs (e.g., longitudinal data). Node2vec2rank will perform pairwise comparisons depending on the strategies. The one-vs-rest will compare each network node embedding with the mean of the remaining network embeddings, producing for K graphs K rankings. The sequential strategy compares each graph node embedding with the next graph node embedding in the order, producing for K graphs K-1 rankings (we use this strategy in the cell cycle notebook). Lastly, the one-vs-before will compare each graph node embedding with the mean embedding of all previous graphs in the order, producing for K graphs K-1 rankings. Note that all strategies are equivalent for two graphs, but we recommend using the sequential strategy for two graphs (to better access the resulting rankings). So far we have not tested graphs with the one-vs-before strategy.
The post_utils class contains functions to perform over-representation analysis (ORA) and gene set enrichment analysis (GSEA) using GSEApy, as well as plotting the results using bubbleplots that are saved as pdfs and are publication ready. We also include gene set libraries such as KEGG and GOBP from MSigDB for your convenience.
Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Distributed under the GPL-3 License. See LICENSE for more information.
Panagiotis Mandros - pmandros[at]hsph[dot]harvard[dot]edu
Anis Ismail - anis[dot]ismail[at]student[dot]kuleuven[dot]be
