Final project for the course "Python for Bioinformatics" (Instructor: Prof. Gilderlânio Santana), completed during my Ph.D.
The goal of this project was to develop a fully self-contained Jupyter Notebook that applies core Python programming and bioinformatics concepts to a real biological question: can we identify groups of co-expressed genes and link them to biological functions using network analysis?
The notebook walks through the entire workflow -- from reading a raw count matrix to annotating functional modules -- using only Python libraries. It demonstrates skills in data wrangling (pandas), graph theory and network analysis (NetworkX), community detection algorithms (Louvain), API-based gene annotation (MyGene.info), and functional enrichment analysis (gProfiler), as well as data visualization with matplotlib and seaborn.
Starting from an RNA-seq gene expression count matrix (42,114 genes x 70 samples), we build a gene co-expression network by computing pairwise Spearman correlations and applying a hard threshold. We then analyze the network topology, identify hub genes, detect co-expression modules, and annotate each module with biological functions from GO and KEGG databases.
| Step | Description | Key Methods |
|---|---|---|
| 1. Data Preprocessing | Read count matrix, filter low-expression genes, z-score normalization, variance-based gene selection | pandas, custom filter_genes() function |
| 2. Co-Expression Calculation | Spearman rank correlation matrix with hard thresholding (|r| > 0.9) to build a binary adjacency matrix | pandas .corr() |
| 3. Network Construction | Build an undirected graph; compute degree, clustering coefficient, and betweenness centrality | NetworkX |
| 4. Gene ID Mapping | Convert Ensembl IDs to standard gene symbols via REST API | mygene (MyGene.info) |
| 5. Hub Gene Identification | Select the top 10% genes by degree as hub genes | Percentile-based thresholding |
| 6. Community Detection | Detect co-expression modules using the Louvain algorithm | python-louvain |
| 7. Functional Annotation | Enrichment analysis (GO, KEGG, Reactome) for each hub module | gprofiler-official |
- 42,114 -> 16,025 -> 1,270 genes: successive filtering by expression (>= 10 counts in 90% of samples) and variance (top 25%) reduced the gene set to a computationally tractable size while retaining the most informative genes
- 550 nodes, 3,174 edges: the co-expression network at Spearman |r| > 0.9 captured only the strongest co-expression relationships
- Scale-free-like topology: the right-skewed degree distribution and high clustering coefficients are characteristic of biological networks, validating the network construction approach
- 20 modules (full network), 4 modules (hub subgraph): Louvain community detection identified distinct groups of co-expressed genes at both scales
- Functional enrichment: gProfiler annotation of the 4 hub modules revealed distinct biological processes per module, linking network structure to biological function
filter_genes(): a reusable function with configurablemin_countandmin_propparameters for flexible gene filtering- Spearman correlation with hard thresholding: chosen over Pearson because gene expression data often violates normality assumptions; absolute value threshold captures both positive and negative co-expression
- Two-scale analysis: community detection on both the full network (global view) and the hub subgraph (focused on regulatory genes) provides complementary perspectives
- Ensembl-to-symbol mapping via MyGene.info API: makes results biologically interpretable
- gProfiler enrichment: queries multiple databases (GO Biological Process, GO Molecular Function, GO Cellular Component, KEGG, Reactome) in a single call
| Library | Purpose |
|---|---|
| pandas | Data manipulation, filtering, and normalization |
| NetworkX | Graph construction, network metrics (degree, clustering, betweenness) |
| python-louvain | Community detection (Louvain algorithm for modularity optimization) |
| seaborn / matplotlib | Visualization (histograms, scatter plots, network graphs) |
| mygene | Gene ID mapping (Ensembl -> gene symbol) via MyGene.info REST API |
| gprofiler-official | Functional enrichment analysis (GO, KEGG, Reactome) |
# Install dependencies
pip install pandas networkx python-louvain seaborn matplotlib mygene gprofiler-official
# Launch the notebook
jupyter notebook gene_coexpression_network_analysis.ipynbThe notebook expects a tab-separated count matrix file named GeneExpressionData.counts in the working directory.
Repository for the final project of the Python for Bioinformatics course of the Graduate Program in Genetics and Molecular Biology at the Federal University of Pará.