GitHub - pmandros/node2vec2rank: Node2vec2rank is an algorithm that can compare graphs to find their differences, e.g., in a case-control study that utilises gene regulatory networks.

node2vec2rank: Large Scale and Stable Graph Differential Analysis via Node Embeddings and Ranking

About The Project

This is the code repository for https://www.biorxiv.org/content/10.1101/2024.06.16.599201v1.

Computational methods in biology can infer large molecular interaction networks from multiple data modalities and resolutions, creating unprecedented opportunities to better understand complex biological phenomena. Such graphs can be built from different conditions and get contrasted to uncover graph-level differences, e.g., a case-control study utilizing gene regulatory networks.
Towards this end, we introduce node2vec2rank, a method for graph differential analysis that ranks nodes according to the disparities of their representations in joint latent embedding spaces. Unlike previous bag-of-features approaches, we leverage recent advances in machine learning and statistics to compare graphs in higher-order structures and in a data-driven manner. Node2vec2rank is computationally efficient and can provably identify the correct ranking of differences. Furthermore, we incorporate stability into n2v2r for an overall procedure that adheres to veridical data science principles by running it multiple times and aggregating the results. See figure below for a demonstration simulating a case-control study.

The method is not limited to comparing only two graphs. Given multiple graphs, there exist strategies such as one-versus-rest that compute pairwise comparisons based on vector arithmetic on the multi-layer node embeddings (see practicalities).

Note, that in our case, node2vec in the title does not correspond to the algorithm by Grover and Leskovec, but rather to any algorithm that can produce (multi-layer) node embeddings. An earlier version of node2vec2rank was using node2vec with transfer learning, but it was unstable and with many paramaters. UASE is based on SVD and is more stable, more efficient, has practically no parameters, and we were able to provide theoretical guarantees about the correct ranking.

While the method is motivated and validated with biological applications, it can be used in any other domain with similar objectives.

This repository provides the method, source code, and example notebooks. In particular, we provide the notebooks corresponding to the biological applications used in the paper, as well as a demo notebook for the general usage.

Getting Started

To get a local copy up and running follow steps below.

Prerequisites

It is recommended to create a new virtual enviroment with conda.

Installation

Clone the repo

git clone https://github.com/pmandros/node2vec2rank

Change to the project repositry
```
cd node2vec2rank
```
Run the following command to create a conda enviroment named n2v2r with the necessary packages. Please note that it might take a few minutes for the environment to be created
```
conda env create --file environment.yaml
```
Activate the environment
```
conda activate n2v2r
```
Install node2vec2rank
```
pip install .
```

Usage

Running node2vec2rank in command line

To run the node2vec2rank algorithm in command line, run the following command with a configuration file as input

python -m node2vec2rank.node2vec2rank --config configs/config_demo_adj_CLI.json

The configuration file template is as follows

{
 "data_io": {
     "save_dir": "<path_to_output>",
     "data_dir": "<path_to_data>",
     "graph_filenames": ["network_control.tsv","network_case.tsv"],
     "seperator": "\t",
     "is_edge_list": false,
     "transpose": false
 },
 "data_preprocessing": {
     "project_unipartite_on": null,
     "threshold": null,
     "top_percent_keep": 100,
     "binarize": false,
     "absolute": false
 },
 "fitting_ranking": {
     "embed_dimensions": [4,6,8,10,12,14,16,18,20,22,24],
     "distance_metrics": ["euclidean","cosine"],
     "seed": null,
     "verbose": 1
 }
}

The configuration parameters have the following functionality

data_io:
  --save_dir SAVE_DIR   Save directory
  --graph_filenames [GRAPH_FILENAMES ...]
                        Graph filenames
  --data_dir DATA_DIR   Data Directory
  --separator SEPARATOR
                        Separator used in the graph files
  --is_edge_list        Whether the input is an edge list or tabular
  --transpose           Whether to transpose the graph adjacency matrices or not if bipartite

data_preprocessing:
  --project_unipartite_on PROJECT_UNIPARTITE_ON
                        If the graph adjacency matrices are non-square (i.e., bipartite), it will make them square by projecting into column or row space
  --threshold THRESHOLD
                        Everything below this value will be 0
  --top_percent_keep TOP_PERCENT_KEEP
                        Keeps the top percentage of edges, turning the rest to 0
  --binarize            Whether to binarize the graphs, turning everything above 0 to 1
  --absolute            Absolute the graphs, i.e., turn negative values into positive

fitting_ranking:
  --embed_dimensions [EMBED_DIMENSIONS ...]
                        A list of all the embedding dimensions to use in n2v2r 
  --distance_metrics [DISTANCE_METRICS ...]
                        A list of all the distance metrics to use in n2v2r (currently supporting "cosine" and/or "Euclidean")
  --seed SEED           Random seed
  --verbose VERBOSE     Verbose level

Running in a Jupyter Notebook Environment

You can also run the code in jupyter notebook. Details about setting up your own workflow in jupyter notebook can be found in the notebooks provided. Check the demo notebook for general usage.

Practicalities

The input files can be either in adjacency format with index and header, or a weighted edge list (three columns of source target and edge weight) without header (the latter supported by networkx). At the moment, n2v2r accepts a list of symmetric dense numpy matrices as input, so the above input files will be converted by the Dataloader accordingly automatically. If your graphs are bipartite and in adjacency format (i.e., non-square matrices), they will be projected to unipartite with multiplication depending on the PROJECT_UNIPARTITE_ON parameter. As an example, if the graphs are in adjacency format with gene regulators in rows and genes in the columns, PROJECT_UNIPARTITE_ON = 'columns' will create symmetric networks by projecting the bipartite networks to gene space. If you want to work directly with bipartite graphs, they should be represented in edge list format and not adjacency. Note that internally the graphs are converted to scipy sparse format prior to the SVD.

The output (i.e., node rankings) and config file are written to disk in the folder specified in the config file with a timestamp attached. The node rankings are all dataframes but not sorted, rather the index follows the original node order as returned by the Dataloader.

Regarding the parameters embed_dimensions and distance_metrics, node2vec2rank runs multiple times for every parameter combination and then all the rankings are aggregated into one using the Borda scheme. The default parameter settings have been tested thoroughly. The data_preprocessing parameters are more involved. In a nutshell, they perform graph preprocessing such as binarization, sparsifying by keeping top edges, absoluting, thresholding, and projecting bipartite graphs to unipartite. We highly recommend performing your own preprocessing and using the resulting networks with n2v2r. Otherwise, check the network_transform function and the order of network transformations.

When the input is more than 2 graphs, there exists three different strategies to compare the graphs: one-vs-rest, sequential, one-vs-before. The last two imply some notion of ordering, e.g., one-vs-before implies strictly ordered graphs (e.g., longitudinal data). Node2vec2rank will perform pairwise comparisons depending on the strategies. The one-vs-rest will compare each network node embedding with the mean of the remaining network embeddings, producing for K graphs K rankings. The sequential strategy compares each graph node embedding with the next graph node embedding in the order, producing for K graphs K-1 rankings (we use this strategy in the cell cycle notebook). Lastly, the one-vs-before will compare each graph node embedding with the mean embedding of all previous graphs in the order, producing for K graphs K-1 rankings. Note that all strategies are equivalent for two graphs, but we recommend using the sequential strategy for two graphs (to better access the resulting rankings). So far we have not tested graphs with the one-vs-before strategy.

The post_utils class contains functions to perform over-representation analysis (ORA) and gene set enrichment analysis (GSEA) using GSEApy, as well as plotting the results using bubbleplots that are saved as pdfs and are publication ready. We also include gene set libraries such as KEGG and GOBP from MSigDB for your convenience.

Contributing

Any contributions you make are greatly appreciated.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

Distributed under the GPL-3 License. See LICENSE for more information.

Contact

Panagiotis Mandros - pmandros[at]hsph[dot]harvard[dot]edu
Anis Ismail - anis[dot]ismail[at]student[dot]kuleuven[dot]be

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

node2vec2rank: Large Scale and Stable Graph Differential Analysis via Node Embeddings and Ranking

Table of Contents

About The Project

Getting Started

Prerequisites

Installation

Usage

Running node2vec2rank in command line

Running in a Jupyter Notebook Environment

Practicalities

Contributing

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 240 Commits
configs		configs
data		data
input		input
node2vec2rank		node2vec2rank
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
n2v2r.png		n2v2r.png
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

node2vec2rank: Large Scale and Stable Graph Differential Analysis via Node Embeddings and Ranking

Table of Contents

About The Project

Getting Started

Prerequisites

Installation

Usage

Running node2vec2rank in command line

Running in a Jupyter Notebook Environment

Practicalities

Contributing

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages