Skip to content

albaaggbb/gene-classification-knn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gene Family Classification with k-NN

This project implements a k-Nearest Neighbors (k-NN) classifier in R to predict gene families from DNA sequences, using hexamer representation as a feature.

The work combines bioinformatics concepts (preprocessing and sequence representation) with biostatistics (modeling, validation, and performance evaluation).

Repository Structure

gene-classification-knn/
├── README.md           # Project description and usage
├── data/               # Input data (raw files)
├── notebooks/          # RMarkdown notebook with the complete analysis
├── scripts/            # Helper R functions
│   ├── libraries.R         # Loads required packages
│   ├── hexamer_functions.R # Functions to generate and count hexamers
│   └── roc_functions.R     # Functions to compute and plot ROC curves
└── results/            # Generated plots and performance metrics

Objectives

  • Transform DNA sequences into numerical vectors using hexamer counts.
  • Implement the k-NN algorithm in R for supervised classification.
  • Evaluate performance with different values of k (1, 3, 5, 7).
  • Compare results using accuracy, error rate, kappa, and ROC curves.

Requirements

This project was developed in R.

Required libraries:

  • here
  • ggplot2
  • class
  • gmodels
  • vcd
  • ROCR

They can be installed with:

install.packages(c("here", "ggplot2", "class", "gmodels", "vcd", "ROCR"))

Execution

  1. Clone the main repository
git clone https://github.com/albaaggbb/gene-classification-knn.git
cd gene-classification-knn
  1. Open the file in RStudio
notebooks/gene_classification_knn.Rmd
  1. Run the analysis to reproduce the results: Heavy computations (hexamer counts, classification results) are cached for efficiency and reproducibility.

Main results

  • Sequence length distribution: most sequences are under 500 bp.
  • k-NN performance: comparison across k = 1, 3, 5, 7.
  • ROC curves: illustrative analysis for selected gene families.
  • Metrics table: accuracy, error, and kappa for each value of k.

Conclusions

  • k = 1 achieved the best accuracy on this dataset, though it may overfit.
  • Increasing k reduced accuracy and agreement (kappa).
  • Hexamer representation proved effective for supervised classification of DNA sequences.
  • The workflow highlights the potential of simple machine learning algorithms in bioinformatics applications.

Autor

Project developed by Alba Górriz.

About

k-Nearest Neighbors classifier for gene family prediction based on DNA hexamer representation in R.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors