The goal of this script is to reduce the redundancy in a set of sequences.
- A FASTA file of sequences. FASTA entry descriptions are limited to less than 128 characters.
- Distance threshold.
Iterate through each transcript in the dictionary and:
- Generate a distance matrix from the sequences in the input file using Clustal Omega.
- Parse distance matrix to dictionary.
- For each FASTA entry, identify as a cluster those other sequences with a distance less than given threshold.
- For each sequence in each cluster, generate a total combined distance score to the other members in the cluster.
- Keep the entry with the lowest distance to the other members of the cluster as a representative sequence. Ignore the other members of the cluster.
- Keep all other entries which have not included in any cluster.
- A FASTA file of all cluster representative sequences and non-clustered sequences.
- Currently only tested on Linux
- Python 2.7
- Biopython (http://biopython.org/wiki/Download#Installation_Instructions)
- clustalo
sudo apt-get install clustalo
- Edit the input_seqs_filename variable of the 'Initiating Variables' section to point to the input sequences.
- 12,610 related protein sequences (retrieved by iterative BLAST search) reduced to 2,048 sequences, using a distance threshold of 0.10, in 20 minutes on an i7 processor.