Gene Co-Expression Network Analysis in Python

Final project for the course "Python for Bioinformatics" (Instructor: Prof. Gilderlânio Santana), completed during my Ph.D.

About This Project

The goal of this project was to develop a fully self-contained Jupyter Notebook that applies core Python programming and bioinformatics concepts to a real biological question: can we identify groups of co-expressed genes and link them to biological functions using network analysis?

The notebook walks through the entire workflow -- from reading a raw count matrix to annotating functional modules -- using only Python libraries. It demonstrates skills in data wrangling (pandas), graph theory and network analysis (NetworkX), community detection algorithms (Louvain), API-based gene annotation (MyGene.info), and functional enrichment analysis (gProfiler), as well as data visualization with matplotlib and seaborn.

Overview

Starting from an RNA-seq gene expression count matrix (42,114 genes x 70 samples), we build a gene co-expression network by computing pairwise Spearman correlations and applying a hard threshold. We then analyze the network topology, identify hub genes, detect co-expression modules, and annotate each module with biological functions from GO and KEGG databases.

Pipeline

Step	Description	Key Methods
1. Data Preprocessing	Read count matrix, filter low-expression genes, z-score normalization, variance-based gene selection	pandas, custom `filter_genes()` function
2. Co-Expression Calculation	Spearman rank correlation matrix with hard thresholding (\|r\| > 0.9) to build a binary adjacency matrix	pandas `.corr()`
3. Network Construction	Build an undirected graph; compute degree, clustering coefficient, and betweenness centrality	NetworkX
4. Gene ID Mapping	Convert Ensembl IDs to standard gene symbols via REST API	mygene (MyGene.info)
5. Hub Gene Identification	Select the top 10% genes by degree as hub genes	Percentile-based thresholding
6. Community Detection	Detect co-expression modules using the Louvain algorithm	python-louvain
7. Functional Annotation	Enrichment analysis (GO, KEGG, Reactome) for each hub module	gprofiler-official

Key Results

42,114 -> 16,025 -> 1,270 genes: successive filtering by expression (>= 10 counts in 90% of samples) and variance (top 25%) reduced the gene set to a computationally tractable size while retaining the most informative genes
550 nodes, 3,174 edges: the co-expression network at Spearman |r| > 0.9 captured only the strongest co-expression relationships
Scale-free-like topology: the right-skewed degree distribution and high clustering coefficients are characteristic of biological networks, validating the network construction approach
20 modules (full network), 4 modules (hub subgraph): Louvain community detection identified distinct groups of co-expressed genes at both scales
Functional enrichment: gProfiler annotation of the 4 hub modules revealed distinct biological processes per module, linking network structure to biological function

Code Highlights

filter_genes(): a reusable function with configurable min_count and min_prop parameters for flexible gene filtering
Spearman correlation with hard thresholding: chosen over Pearson because gene expression data often violates normality assumptions; absolute value threshold captures both positive and negative co-expression
Two-scale analysis: community detection on both the full network (global view) and the hub subgraph (focused on regulatory genes) provides complementary perspectives
Ensembl-to-symbol mapping via MyGene.info API: makes results biologically interpretable
gProfiler enrichment: queries multiple databases (GO Biological Process, GO Molecular Function, GO Cellular Component, KEGG, Reactome) in a single call

Tools & Libraries

Library	Purpose
pandas	Data manipulation, filtering, and normalization
NetworkX	Graph construction, network metrics (degree, clustering, betweenness)
python-louvain	Community detection (Louvain algorithm for modularity optimization)
seaborn / matplotlib	Visualization (histograms, scatter plots, network graphs)
mygene	Gene ID mapping (Ensembl -> gene symbol) via MyGene.info REST API
gprofiler-official	Functional enrichment analysis (GO, KEGG, Reactome)

How to Run

# Install dependencies
pip install pandas networkx python-louvain seaborn matplotlib mygene gprofiler-official

# Launch the notebook
jupyter notebook gene_coexpression_network_analysis.ipynb

The notebook expects a tab-separated count matrix file named GeneExpressionData.counts in the working directory.

Author

Juliana Pinto -- Ph.D. student in Genetics and Molecular Biology

bioinfo_python_course

Repository for the final project of the Python for Bioinformatics course of the Graduate Program in Genetics and Molecular Biology at the Federal University of Pará.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
gene_coexpression_network_analysis.ipynb		gene_coexpression_network_analysis.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gene Co-Expression Network Analysis in Python

About This Project

Overview

Pipeline

Key Results

Code Highlights

Tools & Libraries

How to Run

Author

Juliana Pinto -- Ph.D. student in Genetics and Molecular Biology

bioinfo_python_course

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Gene Co-Expression Network Analysis in Python

About This Project

Overview

Pipeline

Key Results

Code Highlights

Tools & Libraries

How to Run

Author

Juliana Pinto -- Ph.D. student in Genetics and Molecular Biology

bioinfo_python_course

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages