This project implements a k-Nearest Neighbors (k-NN) classifier in R to predict gene families from DNA sequences, using hexamer representation as a feature.
The work combines bioinformatics concepts (preprocessing and sequence representation) with biostatistics (modeling, validation, and performance evaluation).
gene-classification-knn/
├── README.md # Project description and usage
├── data/ # Input data (raw files)
├── notebooks/ # RMarkdown notebook with the complete analysis
├── scripts/ # Helper R functions
│ ├── libraries.R # Loads required packages
│ ├── hexamer_functions.R # Functions to generate and count hexamers
│ └── roc_functions.R # Functions to compute and plot ROC curves
└── results/ # Generated plots and performance metrics- Transform DNA sequences into numerical vectors using hexamer counts.
- Implement the k-NN algorithm in R for supervised classification.
- Evaluate performance with different values of k (1, 3, 5, 7).
- Compare results using accuracy, error rate, kappa, and ROC curves.
This project was developed in R.
Required libraries:
hereggplot2classgmodelsvcdROCR
They can be installed with:
install.packages(c("here", "ggplot2", "class", "gmodels", "vcd", "ROCR"))- Clone the main repository
git clone https://github.com/albaaggbb/gene-classification-knn.git
cd gene-classification-knn- Open the file in RStudio
notebooks/gene_classification_knn.Rmd- Run the analysis to reproduce the results: Heavy computations (hexamer counts, classification results) are cached for efficiency and reproducibility.
- Sequence length distribution: most sequences are under 500 bp.
- k-NN performance: comparison across k = 1, 3, 5, 7.
- ROC curves: illustrative analysis for selected gene families.
- Metrics table: accuracy, error, and kappa for each value of k.
- k = 1 achieved the best accuracy on this dataset, though it may overfit.
- Increasing k reduced accuracy and agreement (kappa).
- Hexamer representation proved effective for supervised classification of DNA sequences.
- The workflow highlights the potential of simple machine learning algorithms in bioinformatics applications.
Project developed by Alba Górriz.