A Python-based bioinformatics project for analyzing protein sequences and extracting important biochemical, structural, and physicochemical properties.
The project reads protein sequences in FASTA format, performs protein analysis using Biopython, computes hydropathy profiles, detects motifs, classifies proteins based on stability and hydrophobicity, and visualizes the results using Matplotlib.
Proteins are biological macromolecules made up of amino acid chains, and their properties determine how they behave in a cell.
This project focuses on sequence-based protein analysis and produces a detailed report for each protein sequence provided in the input.
The analyzer calculates:
- Protein Length
- Molecular Weight
- Isoelectric Point (pI)
- GRAVY Score (Grand Average of Hydropathy)
- Instability Index
- Secondary Structure Fractions
- Hydropathy Profile
- Motif Locations
- Protein Classification
-
🧾 FASTA Parsing
- Reads protein sequences from FASTA format
- Supports sequence names and sequence data separation
-
📂 External FASTA File Support
- Can be extended to load
.fasta,.fa, or.txtFASTA files from:- local system
- project folder
- downloads folder
- any custom file path
- Can be extended to load
-
🧪 Protein Property Analysis
- Uses
Bio.SeqUtils.ProtParam.ProteinAnalysis - Computes:
- molecular weight
- isoelectric point
- GRAVY
- instability index
- helix fraction
- turn fraction
- sheet fraction
- Uses
-
🌊 Hydropathy Plot
- Uses a sliding-window method to calculate local hydropathy values
- Helps identify hydrophobic and hydrophilic regions in the protein
-
🔍 Motif Detection
- Uses regular expressions to identify sequence motifs
- Default motif used in the project:
N[^P][ST]
-
🧠 Protein Classification
- Classifies proteins as:
Stable & HydrophobicStable & HydrophilicUnstable Protein
- Classifies proteins as:
-
📊 Tabular Summary
- Displays all computed values in a Pandas DataFrame
-
📈 Graphical Visualization
- Generates hydropathy plots for each protein sequence
- Python
- Biopython
- Matplotlib
- Pandas
- Regular Expressions (
re)
The program begins with protein sequences written in FASTA format.
>Protein_Name
SEQUENCE
The project is not limited to hardcoded FASTA sequences only.
Users can also import their own external FASTA files (.fasta, .fa, .txt) from anywhere on their local system for analysis.
This makes the project more flexible and closer to real-world bioinformatics workflows where protein datasets are usually stored as external FASTA files.
C:\Users\yourname\Desktop\protein.fasta
D:\Bioinformatics\sample.fa
./datasets/proteins.fasta