Gene Family Classification with k-NN

This project implements a k-Nearest Neighbors (k-NN) classifier in R to predict gene families from DNA sequences, using hexamer representation as a feature.

The work combines bioinformatics concepts (preprocessing and sequence representation) with biostatistics (modeling, validation, and performance evaluation).

Repository Structure

gene-classification-knn/
├── README.md           # Project description and usage
├── data/               # Input data (raw files)
├── notebooks/          # RMarkdown notebook with the complete analysis
├── scripts/            # Helper R functions
│   ├── libraries.R         # Loads required packages
│   ├── hexamer_functions.R # Functions to generate and count hexamers
│   └── roc_functions.R     # Functions to compute and plot ROC curves
└── results/            # Generated plots and performance metrics

Objectives

Transform DNA sequences into numerical vectors using hexamer counts.
Implement the k-NN algorithm in R for supervised classification.
Evaluate performance with different values of k (1, 3, 5, 7).
Compare results using accuracy, error rate, kappa, and ROC curves.

Requirements

This project was developed in R.

Required libraries:

here
ggplot2
class
gmodels
vcd
ROCR

They can be installed with:

install.packages(c("here", "ggplot2", "class", "gmodels", "vcd", "ROCR"))

Execution

Clone the main repository

git clone https://github.com/albaaggbb/gene-classification-knn.git
cd gene-classification-knn

Open the file in RStudio

notebooks/gene_classification_knn.Rmd

Run the analysis to reproduce the results: Heavy computations (hexamer counts, classification results) are cached for efficiency and reproducibility.

Main results

Sequence length distribution: most sequences are under 500 bp.
k-NN performance: comparison across k = 1, 3, 5, 7.
ROC curves: illustrative analysis for selected gene families.
Metrics table: accuracy, error, and kappa for each value of k.

Conclusions

k = 1 achieved the best accuracy on this dataset, though it may overfit.
Increasing k reduced accuracy and agreement (kappa).
Hexamer representation proved effective for supervised classification of DNA sequences.
The workflow highlights the potential of simple machine learning algorithms in bioinformatics applications.

Autor

Project developed by Alba Górriz.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gene Family Classification with k-NN

Repository Structure

Objectives

Requirements

Execution

Main results

Conclusions

Autor

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
notebooks		notebooks
results		results
scripts		scripts
.gitignore		.gitignore
README.md		README.md
gene-classification-knn.Rproj		gene-classification-knn.Rproj

Folders and files

Latest commit

History

Repository files navigation

Gene Family Classification with k-NN

Repository Structure

Objectives

Requirements

Execution

Main results

Conclusions

Autor

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages