Skip to content

juliana-bap/bioinfo_python_course

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Gene Co-Expression Network Analysis in Python

Final project for the course "Python for Bioinformatics" (Instructor: Prof. Gilderlânio Santana), completed during my Ph.D.

About This Project

The goal of this project was to develop a fully self-contained Jupyter Notebook that applies core Python programming and bioinformatics concepts to a real biological question: can we identify groups of co-expressed genes and link them to biological functions using network analysis?

The notebook walks through the entire workflow -- from reading a raw count matrix to annotating functional modules -- using only Python libraries. It demonstrates skills in data wrangling (pandas), graph theory and network analysis (NetworkX), community detection algorithms (Louvain), API-based gene annotation (MyGene.info), and functional enrichment analysis (gProfiler), as well as data visualization with matplotlib and seaborn.

Overview

Starting from an RNA-seq gene expression count matrix (42,114 genes x 70 samples), we build a gene co-expression network by computing pairwise Spearman correlations and applying a hard threshold. We then analyze the network topology, identify hub genes, detect co-expression modules, and annotate each module with biological functions from GO and KEGG databases.

Pipeline

Step Description Key Methods
1. Data Preprocessing Read count matrix, filter low-expression genes, z-score normalization, variance-based gene selection pandas, custom filter_genes() function
2. Co-Expression Calculation Spearman rank correlation matrix with hard thresholding (|r| > 0.9) to build a binary adjacency matrix pandas .corr()
3. Network Construction Build an undirected graph; compute degree, clustering coefficient, and betweenness centrality NetworkX
4. Gene ID Mapping Convert Ensembl IDs to standard gene symbols via REST API mygene (MyGene.info)
5. Hub Gene Identification Select the top 10% genes by degree as hub genes Percentile-based thresholding
6. Community Detection Detect co-expression modules using the Louvain algorithm python-louvain
7. Functional Annotation Enrichment analysis (GO, KEGG, Reactome) for each hub module gprofiler-official

Key Results

  • 42,114 -> 16,025 -> 1,270 genes: successive filtering by expression (>= 10 counts in 90% of samples) and variance (top 25%) reduced the gene set to a computationally tractable size while retaining the most informative genes
  • 550 nodes, 3,174 edges: the co-expression network at Spearman |r| > 0.9 captured only the strongest co-expression relationships
  • Scale-free-like topology: the right-skewed degree distribution and high clustering coefficients are characteristic of biological networks, validating the network construction approach
  • 20 modules (full network), 4 modules (hub subgraph): Louvain community detection identified distinct groups of co-expressed genes at both scales
  • Functional enrichment: gProfiler annotation of the 4 hub modules revealed distinct biological processes per module, linking network structure to biological function

Code Highlights

  • filter_genes(): a reusable function with configurable min_count and min_prop parameters for flexible gene filtering
  • Spearman correlation with hard thresholding: chosen over Pearson because gene expression data often violates normality assumptions; absolute value threshold captures both positive and negative co-expression
  • Two-scale analysis: community detection on both the full network (global view) and the hub subgraph (focused on regulatory genes) provides complementary perspectives
  • Ensembl-to-symbol mapping via MyGene.info API: makes results biologically interpretable
  • gProfiler enrichment: queries multiple databases (GO Biological Process, GO Molecular Function, GO Cellular Component, KEGG, Reactome) in a single call

Tools & Libraries

Library Purpose
pandas Data manipulation, filtering, and normalization
NetworkX Graph construction, network metrics (degree, clustering, betweenness)
python-louvain Community detection (Louvain algorithm for modularity optimization)
seaborn / matplotlib Visualization (histograms, scatter plots, network graphs)
mygene Gene ID mapping (Ensembl -> gene symbol) via MyGene.info REST API
gprofiler-official Functional enrichment analysis (GO, KEGG, Reactome)

How to Run

# Install dependencies
pip install pandas networkx python-louvain seaborn matplotlib mygene gprofiler-official

# Launch the notebook
jupyter notebook gene_coexpression_network_analysis.ipynb

The notebook expects a tab-separated count matrix file named GeneExpressionData.counts in the working directory.

Author

Juliana Pinto -- Ph.D. student in Genetics and Molecular Biology

bioinfo_python_course

Repository for the final project of the Python for Bioinformatics course of the Graduate Program in Genetics and Molecular Biology at the Federal University of Pará.

About

Repository for the final project of the Python for Bioinformatics course of the Graduate Program in Genetics and Molecular Biology at the Federal University of Pará.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors