diff --git a/BINF2025_TP3.ipynb b/BINF2025_TP3.ipynb
index 61e87c2..5faf1e9 100644
--- a/BINF2025_TP3.ipynb
+++ b/BINF2025_TP3.ipynb
@@ -1,481 +1,723 @@
{
- "nbformat": 4,
- "nbformat_minor": 0,
- "metadata": {
- "colab": {
- "provenance": [],
- "authorship_tag": "ABX9TyNSXnqaXAUgZK9rmJ1TWbGo"
- },
- "kernelspec": {
- "name": "python3",
- "display_name": "Python 3"
- },
- "language_info": {
- "name": "python"
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "V09wQ1WIOmgn"
+ },
+ "source": [
+ "# BINF TP3 - Algorithmes d'alignement par paire"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "er6CtAyOxC6F"
+ },
+ "source": [
+ "Dans ce TP nous allons manipuler les algorithmes d'alignement par paire."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "BqEa3BJ1xICM"
+ },
+ "source": [
+ "# Exercice 0 - Echauffement"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qqiiq5bcxYvM"
+ },
+ "source": [
+ "Q1. Donnez le score de la superposition :\n",
+ "\n",
+ "| | |\n",
+ "| :---: | :---: |\n",
+ "x | ATGTCATGA---TAC |\n",
+ "y | AT--CTAAATGTTAC |\n",
+ "\n",
+ "\n",
+ "étant donne le schéma d'évaluation :\n",
+ "\n",
+ "| | A | T | G | C |\n",
+ "| :---: | :---: | :---: | :---: | :---: |\n",
+ "| **A** | 1 | -1 | -1 | -1 |\n",
+ "| **T** | -1 | 1 | -1 | -1 |\n",
+ "| **G** | -1 | -1 | 1 | -1 |\n",
+ "| **C** | -1 | -1 | -1 | 1 |\n",
+ "\n",
+ "et\n",
+ "\n",
+ "$\\gamma(g) = 0.5 |g| + 0.5$"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "kCJGGGYQ2GNi"
+ },
+ "source": [
+ "```markdown\n",
+ "On compare caractère par caractère les 2 séquences et à partir de cela on applique la matrice de substitution. Ainsi lorsqu'il y a un match identique on va ajouter 1 et pour un mismatch on va retirer un et pour les gaps on va calculer la longueur du gap et appliquer la fonction des trous. Un gap de longueur 2 va donner gamma(2) = 0.5 * 2 + 0.5 = 1.5 et un gap de longueur 3 va donner gamma(3) = 0.5 * 3 + 0.5 = 2.\n",
+ "Ce qui va nous donner : 1 + 1 - 1.5 + 1 - 1 - 1 - 1 + 1 - 2 + 1 + 1 + 1 = 0.5\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "XyhXAhK-2NKJ"
+ },
+ "source": [
+ "Q2. Alignez les séquences suivantes avec l'algorithme de Levenshtein : x = ATG et y = ACTG."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "b9iovhyZ2bXw"
+ },
+ "source": [
+ "```markdown\n",
+ "Si d'abord on calcule la distance de Levenshtein sur les 2 séquences suivantes on aura 1 car on a juste à ajouter un C au premier caractère du premier mot.\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "OV_YaQHr2elB"
+ },
+ "source": [
+ "Q3.\tAlignez les séquences suivantes avec l'algorithme de Needleman-Wunsch global x = TAT et y = ATGAC en considérant le schéma d'évaluation suivant\n",
+ "\n",
+ "| | A | T | G | C |\n",
+ "| :---: | :---: | :---: | :---: | :---: |\n",
+ "| **A** | 1 | -0.5 | -0.5 | -0.5 |\n",
+ "| **T** | -0.5 | 1 | -0.5 | -0.5 |\n",
+ "| **G** | -0.5 | -0.5 | 1 | -0.5 |\n",
+ "| **C** | -0.5 | -0.5 | -0.5 | 1 |\n",
+ "\n",
+ "et\n",
+ "\n",
+ "$\\gamma(g) = 0.5 |g|$\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "g_MrecVs3Nrw"
+ },
+ "source": [
+ "```markdown\n",
+ "On rempli tout d'abord la matrice en utilisant le critère de remplissage puis lorsque la matrice est totalement remplie on remonte la matrice de backtrack. Ainsi on trouve x = T-AT et y = TGAC.\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "y1YF-G6E3Qoo"
+ },
+ "source": [
+ "Q4. Alignez les séquences suivantes avec l'algorithme de Smith-Waterman x = TTGG y = ATGAC en utilisant le schéma d'évaluation de la question précédente.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "LLMECocb3pgI"
+ },
+ "source": [
+ "```markdown\n",
+ "On rempli la matrice de en utilisant le critère de remplissage avec en parallèle la matrice de backtrack et ensuite on remonte la matrice de backtrack. Ainsi on trouve un score optimal de 2.5.\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "46gw0avh3wGw"
+ },
+ "source": [
+ "# Exercice 1 : Algorithme de Levenshtein - version récursive"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZKc09Kyg4a6v"
+ },
+ "source": [
+ "Q1. Ecrivez une fonction\n",
+ "\n",
+ "levenshtein(x: str, y: str) -> int\n",
+ "\n",
+ "qui retourne la distance de Levenshtein entre les séquences x et y en utilisant la version récursive de l'algorithme."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def levenshtein_rec(x, y, m, n):\n",
+ " if (m == 0):\n",
+ " return n\n",
+ " if n == 0:\n",
+ " return m\n",
+ " \n",
+ " if (x[m-1] == y[n-1]):\n",
+ " return levenshtein_rec(x, y, m-1,n-1)\n",
+ " return 1 + min(\n",
+ " levenshtein_rec(x, y, m, n-1),\n",
+ " min(\n",
+ " levenshtein_rec(x, y, m-1, n),\n",
+ " levenshtein_rec(x, y, m-1,n-1)\n",
+ " )\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "id": "FJR69IEQ4aHv"
+ },
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "\n",
+ "UP = 1\n",
+ "LEFT = 2\n",
+ "DIAG = 3\n",
+ "\n",
+ "def levenshtein_iter(x, y):\n",
+ " m, n = len(x), len(y)\n",
+ " \n",
+ " # Matrices\n",
+ " S = np.zeros((m+1, n+1)) # Matrice des scores\n",
+ " B = np.zeros((m+1, n+1), dtype=int) # Matrice des directions\n",
+ " \n",
+ " # Initialisation des bords\n",
+ " for i in range(1, m+1):\n",
+ " S[i, 0] = i\n",
+ " B[i, 0] = UP\n",
+ " for j in range(1, n+1):\n",
+ " S[0, j] = j\n",
+ " B[0, j] = LEFT\n",
+ " \n",
+ " # Remplissage des matrices\n",
+ " for i in range(1, m+1):\n",
+ " for j in range(1, n+1):\n",
+ " cost = 0 if x[i-1] == y[j-1] else 1\n",
+ "\n",
+ " diag = S[i-1, j-1] + cost\n",
+ " up = S[i-1, j] + 1\n",
+ " left = S[i, j-1] + 1\n",
+ "\n",
+ " S[i, j] = min(diag, up, left)\n",
+ "\n",
+ " # Stockage de la direction optimale\n",
+ " if S[i, j] == diag:\n",
+ " B[i, j] = DIAG\n",
+ " elif S[i, j] == up:\n",
+ " B[i, j] = UP\n",
+ " else:\n",
+ " B[i, j] = LEFT\n",
+ " \n",
+ " # Backtrack\n",
+ " i, j = m, n\n",
+ " sx, sy = \"\", \"\"\n",
+ "\n",
+ " while (i, j) != (0, 0):\n",
+ " if B[i, j] == DIAG:\n",
+ " sx = x[i-1] + sx\n",
+ " sy = y[j-1] + sy\n",
+ " i, j = i-1, j-1\n",
+ " elif B[i, j] == UP:\n",
+ " sx = x[i-1] + sx\n",
+ " sy = \"-\" + sy\n",
+ " i -= 1\n",
+ " else: # LEFT\n",
+ " sx = \"-\" + sx\n",
+ " sy = y[j-1] + sy\n",
+ " j -= 1\n",
+ "\n",
+ " return S[m, n], sx, sy"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "2\n",
+ "3\n",
+ "391.0 ATGG--T-GAGC-AA---GGGCGAG------G-AG-GAT-AACAT-GGCCA-TC-AT-C--A-AG---GAGTTCA--TG-CG--C--TTCA----AG--GT-GC-A--CATG--G----AG--GGCTCCGTGAACGGCCACGAGTTCGAGATCGAG-GGCGAGGGCG-AGG-GCCGCCCCTACGAGGGCACCCAGACCGCCAAGCTGAAGGTG-ACCA-AGGGTGGCCCCCTG-CCCTTCGCCTGGGACATCCTGTCCCCTCAGTTCATGTACGG-CTCCAAGGCCTACGTGAAG-CACCCCG--CCGA--CATC--CCCGACTACTTGAAGCTGTCCTTCCCCGAGGGCTTCAAGTGGGAGCGCGTGATGAACTTCGAGGACGGCGGCGTGGTGAC-CGTGAC--CCAGGACTCCTCCCTGCAGGACGGCGAGTTCATCTACAAGGTGAAGCTGCGCGGCACCAACTTCCCCTCCGACGGCCCCGTAAT--G-CAGAAG-AAGACCATGGGCTGGGAGGCCTCCTCCG-A-GC-GGATGTACCCCGAG--GACG-GCGCCCTGAAGGGCGAGATCAAGCAGAGGCTGAAGCT-GAAGGACGGCGGC-CACTACGACGCTGAGGTCAAGACCACCTACA-AGGCCAA-----GA-AGCCC-GTGCAGCTGCCCGGC-GCC--TACAACGTCAACATCAAGTTGGAC-ATC---ACCTCCC-AC--AACGAG-GACTACACCATCGTGGAACAGTACG-AACGCGCCGAGGGCCGCCACTCCACCGGCGGCATGGACGAGCTGTACAAGTAA ATGGTCTCCTTCAAATCTCTCCTAGTTCTCTGTTGCGCTGCCCTTGGGGCATTCGCTACGAAGAGAATGAGTAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGTGATGTTAATGGGCACAAATTTTCTGTC-AGTGGAGAGGGTGAAGGTGATGCAAC-A-TACGGAAAACTTACC-CTTAAATTTATTTGCACTACTGGAAAACTACCTGTTCCATGGCC---AACA-CTTGT-CACTACTTTCACCTATGGTGTTCAATG-CT-TTTCAAGATACCCAGATCATATGAAGCGGCACGACTTCTTCAAGAGCGCCATGCCTGAGGGATACGTGCAGGAGAGGACCATCTTCTTCAAAGACGACGG-GAACT-ACAAGACACGTGCTGAAGTCAAGTTTG-AGGGAGACACCCTCGTCAACAGGATCGAGCTTAAGGGAATCGATTTCAAGGAGGACGGAAACATCCTCGGCCACAAGTTGGAATA-CAACTACAACTCCCACAACGTATACATCATGGCCGACAAGCAAAAGAACGGCATCAAAGCCAACTTCAAGACCCGCCACAACATCGAA-GACGGCGGCGTGCAAC-TCGCTGA--TCATTATCAACAAAATACTCCAATTGGCGATGGCCCTGTCCTTTTACCAGACAACCATTAC-CTGTCCACA-CAATCTGCCCTTTCGAAAGATCCCAACGAAAAGAGAGACCACATGGTCCTTCTTGAGTTTGTAAC-AGCTGCTGG--G--ATTACA-C-ATGGCATGGATGAACTATACAAATAA\n"
+ ]
}
+ ],
+ "source": [
+ "val = levenshtein_rec('CCAG', 'CA', len('CCAG'), len('CA'))\n",
+ "print(val)\n",
+ "\n",
+ "val = levenshtein_rec('CCGT', 'CGTCA', len('CCGT'), len('CGTCA'))\n",
+ "print(val)\n",
+ "\n",
+ "val, sx, sy = levenshtein_iter(\"ATGGTGAGCAAGGGCGAGGAGGATAACATGGCCATCATCAAGGAGTTCATGCGCTTCAAGGTGCACATGGAGGGCTCCGTGAACGGCCACGAGTTCGAGATCGAGGGCGAGGGCGAGGGCCGCCCCTACGAGGGCACCCAGACCGCCAAGCTGAAGGTGACCAAGGGTGGCCCCCTGCCCTTCGCCTGGGACATCCTGTCCCCTCAGTTCATGTACGGCTCCAAGGCCTACGTGAAGCACCCCGCCGACATCCCCGACTACTTGAAGCTGTCCTTCCCCGAGGGCTTCAAGTGGGAGCGCGTGATGAACTTCGAGGACGGCGGCGTGGTGACCGTGACCCAGGACTCCTCCCTGCAGGACGGCGAGTTCATCTACAAGGTGAAGCTGCGCGGCACCAACTTCCCCTCCGACGGCCCCGTAATGCAGAAGAAGACCATGGGCTGGGAGGCCTCCTCCGAGCGGATGTACCCCGAGGACGGCGCCCTGAAGGGCGAGATCAAGCAGAGGCTGAAGCTGAAGGACGGCGGCCACTACGACGCTGAGGTCAAGACCACCTACAAGGCCAAGAAGCCCGTGCAGCTGCCCGGCGCCTACAACGTCAACATCAAGTTGGACATCACCTCCCACAACGAGGACTACACCATCGTGGAACAGTACGAACGCGCCGAGGGCCGCCACTCCACCGGCGGCATGGACGAGCTGTACAAGTAA\", \"ATGGTCTCCTTCAAATCTCTCCTAGTTCTCTGTTGCGCTGCCCTTGGGGCATTCGCTACGAAGAGAATGAGTAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGTGATGTTAATGGGCACAAATTTTCTGTCAGTGGAGAGGGTGAAGGTGATGCAACATACGGAAAACTTACCCTTAAATTTATTTGCACTACTGGAAAACTACCTGTTCCATGGCCAACACTTGTCACTACTTTCACCTATGGTGTTCAATGCTTTTCAAGATACCCAGATCATATGAAGCGGCACGACTTCTTCAAGAGCGCCATGCCTGAGGGATACGTGCAGGAGAGGACCATCTTCTTCAAAGACGACGGGAACTACAAGACACGTGCTGAAGTCAAGTTTGAGGGAGACACCCTCGTCAACAGGATCGAGCTTAAGGGAATCGATTTCAAGGAGGACGGAAACATCCTCGGCCACAAGTTGGAATACAACTACAACTCCCACAACGTATACATCATGGCCGACAAGCAAAAGAACGGCATCAAAGCCAACTTCAAGACCCGCCACAACATCGAAGACGGCGGCGTGCAACTCGCTGATCATTATCAACAAAATACTCCAATTGGCGATGGCCCTGTCCTTTTACCAGACAACCATTACCTGTCCACACAATCTGCCCTTTCGAAAGATCCCAACGAAAAGAGAGACCACATGGTCCTTCTTGAGTTTGTAACAGCTGCTGGGATTACACATGGCATGGATGAACTATACAAATAA\")\n",
+ "print(val, \" \", sx, \" \", sy)"
+ ]
},
- "cells": [
- {
- "cell_type": "markdown",
- "source": [
- "# BINF TP3 - Algorithmes d'alignement par paire"
- ],
- "metadata": {
- "id": "V09wQ1WIOmgn"
- }
- },
- {
- "cell_type": "markdown",
- "source": [
- "Dans ce TP nous allons manipuler les algorithmes d'alignement par paire."
- ],
- "metadata": {
- "id": "er6CtAyOxC6F"
- }
- },
- {
- "cell_type": "markdown",
- "source": [
- "# Exercice 0 - Echauffement"
- ],
- "metadata": {
- "id": "BqEa3BJ1xICM"
- }
- },
- {
- "cell_type": "markdown",
- "source": [
- "Q1. Donnez le score de la superposition :\n",
- "\n",
- "| | |\n",
- "| :---: | :---: |\n",
- "x | ATGTCATGA---TAC |\n",
- "y | AT--CTAAATGTTAC |\n",
- "\n",
- "\n",
- "étant donne le schéma d'évaluation :\n",
- "\n",
- "| | A | T | G | C |\n",
- "| :---: | :---: | :---: | :---: | :---: |\n",
- "| **A** | 1 | -1 | -1 | -1 |\n",
- "| **T** | -1 | 1 | -1 | -1 |\n",
- "| **G** | -1 | -1 | 1 | -1 |\n",
- "| **C** | -1 | -1 | -1 | 1 |\n",
- "\n",
- "et\n",
- "\n",
- "$\\gamma(g) = 0.5 |g| + 0.5$"
- ],
- "metadata": {
- "id": "qqiiq5bcxYvM"
- }
- },
- {
- "cell_type": "markdown",
- "source": [
- "```markdown\n",
- "Votre réponse ici\n",
- "```"
- ],
- "metadata": {
- "id": "kCJGGGYQ2GNi"
- }
- },
- {
- "cell_type": "markdown",
- "source": [
- "Q2. Alignez les séquences suivantes avec l'algorithme de Levenshtein : x = ATG et y = ACTG."
- ],
- "metadata": {
- "id": "XyhXAhK-2NKJ"
- }
- },
- {
- "cell_type": "markdown",
- "source": [
- "```markdown\n",
- "Votre réponse ici\n",
- "```"
- ],
- "metadata": {
- "id": "b9iovhyZ2bXw"
- }
- },
- {
- "cell_type": "markdown",
- "source": [
- "Q3.\tAlignez les séquences suivantes avec l'algorithme de Needleman-Wunsch global x = TAT et y = ATGAC en considérant le schéma d'évaluation suivant\n",
- "\n",
- "| | A | T | G | C |\n",
- "| :---: | :---: | :---: | :---: | :---: |\n",
- "| **A** | 1 | -0.5 | -0.5 | -0.5 |\n",
- "| **T** | -0.5 | 1 | -0.5 | -0.5 |\n",
- "| **G** | -0.5 | -0.5 | 1 | -0.5 |\n",
- "| **C** | -0.5 | -0.5 | -0.5 | 1 |\n",
- "\n",
- "et\n",
- "\n",
- "$\\gamma(g) = 0.5 |g|$\n"
- ],
- "metadata": {
- "id": "OV_YaQHr2elB"
- }
- },
- {
- "cell_type": "markdown",
- "source": [
- "```markdown\n",
- "Votre réponse ici\n",
- "```"
- ],
- "metadata": {
- "id": "g_MrecVs3Nrw"
- }
- },
- {
- "cell_type": "markdown",
- "source": [
- "Q4. Alignez les séquences suivantes avec l'algorithme de Smith-Waterman x = TTGG y = ATGAC en utilisant le schéma d'évaluation de la question précédente.\n"
- ],
- "metadata": {
- "id": "y1YF-G6E3Qoo"
- }
- },
- {
- "cell_type": "markdown",
- "source": [
- "```markdown\n",
- "Votre réponse ici\n",
- "```"
- ],
- "metadata": {
- "id": "LLMECocb3pgI"
- }
- },
- {
- "cell_type": "markdown",
- "source": [
- "# Exercice 1 : Algorithme de Levenshtein - version récursive"
- ],
- "metadata": {
- "id": "46gw0avh3wGw"
- }
- },
- {
- "cell_type": "markdown",
- "source": [
- "Q1. Ecrivez une fonction\n",
- "\n",
- "levenshtein(x: str, y: str) -> int\n",
- "\n",
- "qui retourne la distance de Levenshtein entre les séquences x et y en utilisant la version récursive de l'algorithme."
- ],
- "metadata": {
- "id": "ZKc09Kyg4a6v"
- }
- },
- {
- "cell_type": "code",
- "source": [
- "#Votre code ici"
- ],
- "metadata": {
- "id": "FJR69IEQ4aHv"
- },
- "execution_count": null,
- "outputs": []
- },
- {
- "cell_type": "markdown",
- "source": [
- "Q2. Vous pouvez tester votre code sur les exemples suivants:\n",
- "\n",
- "\n",
- "* $L('CCAG', 'CA') = 2$\n",
- "* $L('CCGT', 'CGTCA') = 3$\n",
- "* $L(AY678264^*, OQ870305^*) = 310$\n",
- "\n",
- "$^*$ ids genbank de deux sequences."
- ],
- "metadata": {
- "id": "arFVwA6E5NWn"
- }
- },
- {
- "cell_type": "markdown",
- "source": [
- "# Exercice 2 : Algorithme de Smith-Waterman - version itérative"
- ],
- "metadata": {
- "id": "erCpfG7O7BV-"
- }
- },
- {
- "cell_type": "markdown",
- "source": [
- "Q1. Ecrivez la fonction\n",
- "\n",
- "sw_fwd(x: str, y: str, cmap: dict, sigma: array, (go, ge): list) -> (array, array)\n",
- "\n",
- "qui construit les matrices $S$ et $B$ en utilisant l'algorithme de Smith-Waterman pour aligner les séquences x et y suivant le schéma d'évaluation donné par la matrice de substitution $\\Sigma$ et la fonction d'évaluation des trous $\\gamma(n)= g_o + g_e \\times n$. Le dictionnaire cmap donne la position des différents nucléotides dans la matrice $\\Sigma$. La fonction retourne la paire de matrices de score $S$ et de retour $B$."
- ],
- "metadata": {
- "id": "rv2Y78y37IOd"
- }
- },
- {
- "cell_type": "code",
- "source": [
- "#Votre code ici"
- ],
- "metadata": {
- "id": "njn3JB0b-WHj"
- },
- "execution_count": null,
- "outputs": []
- },
- {
- "cell_type": "markdown",
- "source": [
- "Q2. Ecrivez la fonction\n",
- "\n",
- "sw_bwd(x: str, y: str, S: array, B: array) -> (str, str, float)\n",
- "\n",
- "qui effectue l'etape de retour de l'algorithme de Smith-Waterman etant donné les séquences $x$ et $y$ et les matrices de score $S$ et de retour $B$. La fonction retourne un tuple contenant les alignements des séquences x et y et le score de l'alignement."
- ],
- "metadata": {
- "id": "55n8mt9U-Wai"
- }
- },
- {
- "cell_type": "code",
- "source": [
- "#Votre code ici"
- ],
- "metadata": {
- "id": "ij9JDpBm_UZ7"
- },
- "execution_count": null,
- "outputs": []
- },
- {
- "cell_type": "markdown",
- "source": [
- "Q3. Vous pouvez tester votre code en utilisant le schéma d'évaluation suivant :"
- ],
- "metadata": {
- "id": "kwmxg2dxAiwS"
- }
- },
- {
- "cell_type": "code",
- "source": [
- "cmap = {\"A\": 0, \"T\": 1, \"G\": 2, \"C\": 3}\n",
- "m = np.array([[1, -0.5, -0.5, -0.5],\n",
- " [-0.5, 1, -0.5, -0.5],\n",
- " [-0.5, -0.5, 1, -0.5],\n",
- " [-0.5, -0.5, -0.5, 1]])\n",
- "go = 0\n",
- "ge = 0.5"
- ],
- "metadata": {
- "id": "JUtYRFTBAwwZ"
- },
- "execution_count": null,
- "outputs": []
- },
- {
- "cell_type": "markdown",
- "source": [
- "* $SW('TCGC', 'CTTAG')$ retourne un score de $1.5$ à la position $(3,5)$ et l'alignement"
- ],
- "metadata": {
- "id": "eMGh4K5aIFxE"
- }
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "arFVwA6E5NWn"
+ },
+ "source": [
+ "Q2. Vous pouvez tester votre code sur les exemples suivants:\n",
+ "\n",
+ "\n",
+ "* $L('CCAG', 'CA') = 2$\n",
+ "* $L('CCGT', 'CGTCA') = 3$\n",
+ "* $L(AY678264^*, OQ870305^*) = 310$\n",
+ "\n",
+ "$^*$ ids genbank de deux sequences."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "erCpfG7O7BV-"
+ },
+ "source": [
+ "# Exercice 2 : Algorithme de Smith-Waterman - version itérative"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "rv2Y78y37IOd"
+ },
+ "source": [
+ "Q1. Ecrivez la fonction\n",
+ "\n",
+ "sw_fwd(x: str, y: str, cmap: dict, sigma: array, (go, ge): list) -> (array, array)\n",
+ "\n",
+ "qui construit les matrices $S$ et $B$ en utilisant l'algorithme de Smith-Waterman pour aligner les séquences x et y suivant le schéma d'évaluation donné par la matrice de substitution $\\Sigma$ et la fonction d'évaluation des trous $\\gamma(n)= g_o + g_e \\times n$. Le dictionnaire cmap donne la position des différents nucléotides dans la matrice $\\Sigma$. La fonction retourne la paire de matrices de score $S$ et de retour $B$."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 39,
+ "metadata": {
+ "id": "njn3JB0b-WHj"
+ },
+ "outputs": [],
+ "source": [
+ "UP = 1\n",
+ "LEFT = 2\n",
+ "DIAG = 3\n",
+ "\n",
+ "def sw_fwd(x, y, cmap, sigma, gap_params):\n",
+ " m, n = len(x), len(y)\n",
+ " \n",
+ " # Matrices\n",
+ " S = np.zeros((m+1, n+1))\n",
+ " B = np.zeros((m+1, n+1))\n",
+ "\n",
+ " go, ge = gap_params\n",
+ " \n",
+ " # Initialiser la première ligne et la première colonne\n",
+ " for i in range(1, m + 1):\n",
+ " S[i, 0] = -go - ge * (i)\n",
+ " B[i, 0] = UP\n",
+ "\n",
+ " for j in range(1, n + 1):\n",
+ " S[0, j] = -go - ge * (j)\n",
+ " B[0, j] = LEFT\n",
+ " \n",
+ " # Remplir les matrices S et B\n",
+ " for i in range(1, m + 1):\n",
+ " for j in range(1, n + 1):\n",
+ " match = S[i - 1, j - 1] + sigma[cmap[x[i - 1]], cmap[y[j - 1]]]\n",
+ " delete = max([S[i - k, j] - (go + ge * (k - 1)) for k in range(1, i + 1)])\n",
+ " insert = max([S[i, j - k] - (go + ge * (k - 1)) for k in range(1, j + 1)])\n",
+ " \n",
+ " S[i, j] = max(0, match, delete, insert)\n",
+ " \n",
+ " if S[i, j] == 0:\n",
+ " B[i, j] = 0\n",
+ " elif S[i, j] == match:\n",
+ " B[i, j] = DIAG # Diagonale (match ou mismatch)\n",
+ " elif S[i, j] == delete:\n",
+ " B[i, j] = UP # Suppression (gap dans y)\n",
+ " else:\n",
+ " B[i, j] = LEFT # Insertion (gap dans x)\n",
+ "\n",
+ " return S, B"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "55n8mt9U-Wai"
+ },
+ "source": [
+ "Q2. Ecrivez la fonction\n",
+ "\n",
+ "sw_bwd(x: str, y: str, S: array, B: array) -> (str, str, float)\n",
+ "\n",
+ "qui effectue l'etape de retour de l'algorithme de Smith-Waterman etant donné les séquences $x$ et $y$ et les matrices de score $S$ et de retour $B$. La fonction retourne un tuple contenant les alignements des séquences x et y et le score de l'alignement."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 40,
+ "metadata": {
+ "id": "ij9JDpBm_UZ7"
+ },
+ "outputs": [],
+ "source": [
+ "def sw_bwd(x, y, S, B):\n",
+ " # Trouver la position du score max\n",
+ " i, j = np.unravel_index(np.argmax(S), S.shape)\n",
+ " sx, sy = \"\", \"\"\n",
+ "\n",
+ " while S[i, j] > 0:\n",
+ " if B[i, j] == DIAG:\n",
+ " sx = x[i-1] + sx\n",
+ " sy = y[j-1] + sy\n",
+ " i, j = i-1, j-1\n",
+ " elif B[i, j] == UP:\n",
+ " sx = x[i-1] + sx\n",
+ " sy = \"-\" + sy\n",
+ " i -= 1\n",
+ " elif B[i, j] == LEFT:\n",
+ " sx = \"-\" + sx\n",
+ " sy = y[j-1] + sy\n",
+ " j -= 1\n",
+ " \n",
+ " # Score de l'alignement\n",
+ " val = S[np.unravel_index(np.argmax(S), S.shape)]\n",
+ " return sx, sy, val"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "kwmxg2dxAiwS"
+ },
+ "source": [
+ "Q3. Vous pouvez tester votre code en utilisant le schéma d'évaluation suivant :"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 51,
+ "metadata": {
+ "id": "JUtYRFTBAwwZ"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Score Matrix (S):\n",
+ " [[ 0. -0.5 -1. -1.5 -2. -2.5]\n",
+ " [-0.5 0. 0.5 0.5 0.5 0.5]\n",
+ " [-1. 0. 1. 1. 1. 1. ]\n",
+ " [-1.5 0. 1. 2. 2. 2. ]\n",
+ " [-2. 0. 1. 2. 2. 2. ]]\n",
+ "Backtrack Matrix (B):\n",
+ " [[0. 2. 2. 2. 2. 2.]\n",
+ " [1. 0. 3. 2. 2. 2.]\n",
+ " [1. 0. 3. 2. 2. 2.]\n",
+ " [1. 0. 1. 3. 2. 2.]\n",
+ " [1. 0. 1. 3. 1. 1.]]\n",
+ "Aligned Sequences:\n",
+ " TG \n",
+ " TG\n",
+ "Alignment Score:\n",
+ " 2.0\n",
+ "Aligned Sequences (Long):\n",
+ " ATGG------------------TGAG----C-AAG--G-GC-G-----AGGAGG-A-----TA--ACATG-G-CCAT-CA-T-CAA-G-G-AG-----TTCA-T-G-CG-CT-T--CAA-----G-GTGCACA-T-GGA-GG-G-CTCCG-TGAA--CGGC-C-ACGA--GTTC-GAGATCGAG-GG-CGAGGG-CG-AGG-G--CCGC--C-CCTAC-G---AGGGC--ACCC--------A---G-AC--C--G--CCAAGCTGA---AG-GTGACCAAGGGTGGCC--C-C-CTG-C-C--C-TTC-GCC--TGG-G---ACAT-C----C--TG-T-CCC---CTCAGTTCATG-TA-CGGC-TC-CAAGGC--C-TACGTGAAGCA-C-CC--CGCC-GA---CAT-C---C--CCGA-CTA-CTTGA--A-GCTGTCCTTC-----CCCGA-GGG--CT-TCAAGTGGGAGC-GCGTG-ATGAA-CTTC-GAG---GACGGCG-G-CGTGGTGA--C-CGT--GACCCAGGACTC---CTCCCTGCAGGACGGCG-AGTTC-A-TCTACAAGGTGAAGCTG-CGCGGCACCAAC-TTCCCCTCCGACGGCC-C--CGTAATGCAGAA-GA-AGAC--CATGGGCTGGGAGGCCTCCTC-C-GAGCG-GATGTAC--C----CCG---AG-----G-ACGGC--GC-CCTGAAG-GGCGAGA--TCAAG---CAG--AGGCTGAAGC-T-GAAGGACGGCGGC---C-ACTACGAC-G--C--TGAGGT---C-AAGA-CCAC-CTACAA--GGC-CA-AGAAGCCC-GTGCAGC---T-GCC-CG-GC--GCC--TACAAC-GT-CA-AC-ATC-AAG----TT-G-GACA-TCACCTCCC-AC---AACGAG-GA-CTACA----CC-ATC-GTGGAACAG--TACG-AAC-GC-GC-CGAGGGCCGCCA-CT-C-CA-CCGGC--GGCATGGACGAGCT-GTAC-AAGTAA \n",
+ " ATGGTCTCCTTCAAATCTCTCCT-AGTTCTCT--GTTGCGCTGCCCTT-GG-GGCATTCGCTACGA-A-GAGA--ATG-AGTA-AAGGAGAAGAACTTTTCACTGGA-GT-TGTCCCAATTCTTGT-TG-A-ATTA-GATGGTGA-T--GTT-AATG-GGCACAA--ATT-TTCT--G-TC-AGTGGA-GAGGGT-GAAGGTGAT--GCAACA--TACGGAAAA---CTTACCCTTAAATTTATTTGCACTACTGGAA--AA-CT-ACCT-GT-T--CC-A---TGGCCAACACT-TGTCACTACTTTCA-CCTATGGTGTTCA-ATGCTTTTCAA-GATACCCAGA-TCA--T-ATGA-AGCGGCA-CG--A--CTTCTT-C---AAG-AGCGCCAT-GCCTGAGGG-ATACGTGCAG--GAG--AG---GACCAT-CT-T-CTTCAAAGA--CGACGGGAACTA-CAA----GA-CA-CGTGC-TGAAG--TCA-AGTTTGA-GG-GAGAC------ACCCTCGTCA-A--CAGGA-TCGAGCT---T--A--A-GG-GAA--TCGATT-T-CAA-G-G-AG--GA--CGG-A--AACA-T--CCT----CGGCCACAA-GT--TG--GAAT-ACA-ACTACA---------A---CTCC-CACA-A-CGT-A--TACATCATGGCCGACAAGCAAAAGAACGGCAT-CA----AAGC--C-A-ACTTCAAGACCC-GCCA--C--AA-CATCGAA-GACGGCGGCGTGCAACT-CG-CTGATCATT-A--TCAACAAA-AT--ACTC--CAATTGGCG-AT-G--GCCCTGT-C--CTTTTA-CCA-GA-CAA-CCATTAC--CTGTCCACACAATCT--GCCCTTTCGA-A-AG--A--TCCCAACGAAAA-GAGAGACC-ACATGGTCCT-TCT-T-G---AGTTT--GTAACAGCTGCT----GG--G--AT-TACACAT--GGCATGG-AT-GA--A-CTA-TACAAA-TAA\n",
+ "Alignment Score (Long):\n",
+ " 475.0\n"
+ ]
+ }
+ ],
+ "source": [
+ "cmap = {\"A\": 0, \"T\": 1, \"G\": 2, \"C\": 3}\n",
+ "m = np.array([[1, -0.5, -0.5, -0.5],\n",
+ " [-0.5, 1, -0.5, -0.5],\n",
+ " [-0.5, -0.5, 1, -0.5],\n",
+ " [-0.5, -0.5, -0.5, 1]])\n",
+ "go = 0\n",
+ "ge = 0.5\n",
+ "\n",
+ "# Exemple de séquences\n",
+ "x = \"TTGG\"\n",
+ "y = \"ATGAC\"\n",
+ "\n",
+ "#x = \"TCGC\"\n",
+ "#y = \"CTTAG\"\n",
+ "\n",
+ "# Calcul des matrices S et B\n",
+ "S, B = sw_fwd(x, y, cmap, m, [go, ge])\n",
+ "\n",
+ "# Traçage arrière\n",
+ "sx, sy, val = sw_bwd(x, y, S, B)\n",
+ "\n",
+ "# Affichage des résultats\n",
+ "print(\"Score Matrix (S):\\n\", S)\n",
+ "print(\"Backtrack Matrix (B):\\n\", B)\n",
+ "print(\"Aligned Sequences:\\n\", sx, \"\\n\", sy)\n",
+ "print(\"Alignment Score:\\n\", val)\n",
+ "\n",
+ "\n",
+ "x = \"ATGGTGAGCAAGGGCGAGGAGGATAACATGGCCATCATCAAGGAGTTCATGCGCTTCAAGGTGCACATGGAGGGCTCCGTGAACGGCCACGAGTTCGAGATCGAGGGCGAGGGCGAGGGCCGCCCCTACGAGGGCACCCAGACCGCCAAGCTGAAGGTGACCAAGGGTGGCCCCCTGCCCTTCGCCTGGGACATCCTGTCCCCTCAGTTCATGTACGGCTCCAAGGCCTACGTGAAGCACCCCGCCGACATCCCCGACTACTTGAAGCTGTCCTTCCCCGAGGGCTTCAAGTGGGAGCGCGTGATGAACTTCGAGGACGGCGGCGTGGTGACCGTGACCCAGGACTCCTCCCTGCAGGACGGCGAGTTCATCTACAAGGTGAAGCTGCGCGGCACCAACTTCCCCTCCGACGGCCCCGTAATGCAGAAGAAGACCATGGGCTGGGAGGCCTCCTCCGAGCGGATGTACCCCGAGGACGGCGCCCTGAAGGGCGAGATCAAGCAGAGGCTGAAGCTGAAGGACGGCGGCCACTACGACGCTGAGGTCAAGACCACCTACAAGGCCAAGAAGCCCGTGCAGCTGCCCGGCGCCTACAACGTCAACATCAAGTTGGACATCACCTCCCACAACGAGGACTACACCATCGTGGAACAGTACGAACGCGCCGAGGGCCGCCACTCCACCGGCGGCATGGACGAGCTGTACAAGTAA\"\n",
+ "y = \"ATGGTCTCCTTCAAATCTCTCCTAGTTCTCTGTTGCGCTGCCCTTGGGGCATTCGCTACGAAGAGAATGAGTAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGTGATGTTAATGGGCACAAATTTTCTGTCAGTGGAGAGGGTGAAGGTGATGCAACATACGGAAAACTTACCCTTAAATTTATTTGCACTACTGGAAAACTACCTGTTCCATGGCCAACACTTGTCACTACTTTCACCTATGGTGTTCAATGCTTTTCAAGATACCCAGATCATATGAAGCGGCACGACTTCTTCAAGAGCGCCATGCCTGAGGGATACGTGCAGGAGAGGACCATCTTCTTCAAAGACGACGGGAACTACAAGACACGTGCTGAAGTCAAGTTTGAGGGAGACACCCTCGTCAACAGGATCGAGCTTAAGGGAATCGATTTCAAGGAGGACGGAAACATCCTCGGCCACAAGTTGGAATACAACTACAACTCCCACAACGTATACATCATGGCCGACAAGCAAAAGAACGGCATCAAAGCCAACTTCAAGACCCGCCACAACATCGAAGACGGCGGCGTGCAACTCGCTGATCATTATCAACAAAATACTCCAATTGGCGATGGCCCTGTCCTTTTACCAGACAACCATTACCTGTCCACACAATCTGCCCTTTCGAAAGATCCCAACGAAAAGAGAGACCACATGGTCCTTCTTGAGTTTGTAACAGCTGCTGGGATTACACATGGCATGGATGAACTATACAAATAA\"\n",
+ "\n",
+ "# Calcul des matrices S et B pour les longues séquences\n",
+ "S, B = sw_fwd(x, y, cmap, m, [go, ge])\n",
+ "\n",
+ "# Traçage arrière pour les longues séquences\n",
+ "sx, sy, val = sw_bwd(x, y, S, B)\n",
+ "\n",
+ "# Affichage des résultats pour les longues séquences\n",
+ "print(\"Aligned Sequences (Long):\\n\", sx, \"\\n\", sy)\n",
+ "print(\"Alignment Score (Long):\\n\", val)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "eMGh4K5aIFxE"
+ },
+ "source": [
+ "* $SW('TCGC', 'CTTAG')$ retourne un score de $1.5$ à la position $(3,5)$ et l'alignement"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 48,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 60
},
+ "id": "joHNwJ9AIf6F",
+ "outputId": "a9206810-a083-4d86-8b14-38183f1dd80c"
+ },
+ "outputs": [
{
- "cell_type": "code",
- "source": [
- "HTML(\"
\")"
+ "data": {
+ "text/html": [
+ ""
],
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 60
- },
- "id": "joHNwJ9AIf6F",
- "outputId": "a9206810-a083-4d86-8b14-38183f1dd80c"
- },
- "execution_count": null,
- "outputs": [
- {
- "output_type": "execute_result",
- "data": {
- "text/plain": [
- ""
- ],
- "text/html": [
- ""
- ]
- },
- "metadata": {},
- "execution_count": 18
- }
+ "text/plain": [
+ ""
]
+ },
+ "execution_count": 48,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "from IPython.display import HTML\n",
+ "HTML(\"\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "JJlU5yvZI43D"
+ },
+ "source": [
+ "* $SW(AY678264^*, OQ870305^*)$ retourne un score de $342.1$ à la position $(708,717)$ et l'alignement"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 49,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 80
},
+ "id": "HUELvWKMFtIO",
+ "outputId": "976bab6f-f1fc-4c5a-c69c-8de02fc838d0"
+ },
+ "outputs": [
{
- "cell_type": "markdown",
- "source": [
- "* $SW(AY678264^*, OQ870305^*)$ retourne un score de $342.1$ à la position $(708,717)$ et l'alignement"
- ],
- "metadata": {
- "id": "JJlU5yvZI43D"
- }
- },
- {
- "cell_type": "code",
- "source": [
- "from IPython.display import HTML\n",
- "HTML(\"| x: | ATGGTGAGCAAGGGCGAGGAGGATAACATGGCCATCATCAAGGAGTTCATGCGCTTCAAGGTGC-A-CATGGAGGGCTCCGTGAACGGCCACGAGTTCGAGATCGAG---GGCGAGGGCGAGGGC--CGCC-CCTACGAGGGCACCCAGACCGC-CAAGCTGAAGGTG-ACCA-AGG---G-TGGCC---CCCT-GCCCTTCGCCT-GGGA-CATCCTGTCC--C--C-T-CAGTTCATGT-A-CGGCT-CCAAGGCCTACGTG-A--AGCAC--C--C--C--G-CCGACATCCCCG-A--CTAC-T--TGAAGCTG-TCCTTC--C--C-----CGA-GG--GCTTCAAGTGGGAGCG-CGTGATGAACTTCGAGGACGGCGGCGTGGTG-ACCG--T-GA-C-CCAGGAC-TC--CTCCCTGCAGGACGGCGAGTTCATCTACAAGGTG---AAGCTGCGCGGCACCAACTTCCCCT-CCGACGGCCCCGTA-ATGCA-GAAGAAGACCATGGGCTG--GGA-GGCCTCCTCCGAGCGGATGTACCCCGAGGA-CGGCGCC-CTGAAGGGCGAGATCAAGCAGA-GGCTGAAGC-TGAAGGACGGCGGCCACTACGACGCTGAGGTCAAGACCACCTACA-AGGCCAAGAAG-CCCGTGCAGCTGCCCGGC-GCCTACAACGTCAACATCAAGT-TG----GA-CATCACCTCCCACAACGAGGA-CTAC-A-C-CA---T-C-G-TGGAACAGTACG-AACGCGCCGAGGGCCGCCACTCCAC-CGGCGGCATGGACGAGCTGTACAAG |
|---|
| y: | ATGGTGAGCAAGGGCGAGGA-G----C-T-G--TTCA-C-CGG-GGTGGTGCCCATCCTGGT-CGAGC-TGGACGGCGACGTAAACGGCCACAAGTTC-AG--CGTGTCCGGCGAGGGCGAGGGCGATGCCACCTAC---GGCAAGCTGACC-CTGAAG-TTCATTTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCC-AC-CCTCGTGACCACCCTGACCTACGGCGTGCAGTGC-T-TCAGCCGCTACCCCGACC-ACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGC-GCACCATCTTCTTCAAGGACGACGGCAACTACAAGA-CCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGC-A--ACATC--C-TGGGGCACAAGCTG-G-AGTA-CAACTACAACAGCC-ACAACGTC-TATAT-CATG--GCCGA-CAA--GCAGAAGAACGG-CA--T-C-A-AGG-TGAACTTC-AAGATC--CGCCAC--AA---C---ATCGAG--GACGGC---AGCGTGCAGCTCGCCGACCACTACCA-GC--A-G--AACACC-CC--CATCGGCGACG--GCCCCGTGCTGCTGCCCGACAACC-ACTACCTGAGCACCCAGTCCGCCCTGAGCAA-A-GACCC-CAACGAGAAGC-GCGATCACATGGTCCTGCTGG---AGTTCGTGAC-CGCC----GCCGGGA-T-CACTC-TCGGCATGGACGAGCTGTACAAG |
|---|
\")"
+ "data": {
+ "text/html": [
+ "| x: | ATGGTGAGCAAGGGCGAGGAGGATAACATGGCCATCATCAAGGAGTTCATGCGCTTCAAGGTGC-A-CATGGAGGGCTCCGTGAACGGCCACGAGTTCGAGATCGAG---GGCGAGGGCGAGGGC--CGCC-CCTACGAGGGCACCCAGACCGC-CAAGCTGAAGGTG-ACCA-AGG---G-TGGCC---CCCT-GCCCTTCGCCT-GGGA-CATCCTGTCC--C--C-T-CAGTTCATGT-A-CGGCT-CCAAGGCCTACGTG-A--AGCAC--C--C--C--G-CCGACATCCCCG-A--CTAC-T--TGAAGCTG-TCCTTC--C--C-----CGA-GG--GCTTCAAGTGGGAGCG-CGTGATGAACTTCGAGGACGGCGGCGTGGTG-ACCG--T-GA-C-CCAGGAC-TC--CTCCCTGCAGGACGGCGAGTTCATCTACAAGGTG---AAGCTGCGCGGCACCAACTTCCCCT-CCGACGGCCCCGTA-ATGCA-GAAGAAGACCATGGGCTG--GGA-GGCCTCCTCCGAGCGGATGTACCCCGAGGA-CGGCGCC-CTGAAGGGCGAGATCAAGCAGA-GGCTGAAGC-TGAAGGACGGCGGCCACTACGACGCTGAGGTCAAGACCACCTACA-AGGCCAAGAAG-CCCGTGCAGCTGCCCGGC-GCCTACAACGTCAACATCAAGT-TG----GA-CATCACCTCCCACAACGAGGA-CTAC-A-C-CA---T-C-G-TGGAACAGTACG-AACGCGCCGAGGGCCGCCACTCCAC-CGGCGGCATGGACGAGCTGTACAAG |
|---|
| y: | ATGGTGAGCAAGGGCGAGGA-G----C-T-G--TTCA-C-CGG-GGTGGTGCCCATCCTGGT-CGAGC-TGGACGGCGACGTAAACGGCCACAAGTTC-AG--CGTGTCCGGCGAGGGCGAGGGCGATGCCACCTAC---GGCAAGCTGACC-CTGAAG-TTCATTTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCC-AC-CCTCGTGACCACCCTGACCTACGGCGTGCAGTGC-T-TCAGCCGCTACCCCGACC-ACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGC-GCACCATCTTCTTCAAGGACGACGGCAACTACAAGA-CCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGC-A--ACATC--C-TGGGGCACAAGCTG-G-AGTA-CAACTACAACAGCC-ACAACGTC-TATAT-CATG--GCCGA-CAA--GCAGAAGAACGG-CA--T-C-A-AGG-TGAACTTC-AAGATC--CGCCAC--AA---C---ATCGAG--GACGGC---AGCGTGCAGCTCGCCGACCACTACCA-GC--A-G--AACACC-CC--CATCGGCGACG--GCCCCGTGCTGCTGCCCGACAACC-ACTACCTGAGCACCCAGTCCGCCCTGAGCAA-A-GACCC-CAACGAGAAGC-GCGATCACATGGTCCTGCTGG---AGTTCGTGAC-CGCC----GCCGGGA-T-CACTC-TCGGCATGGACGAGCTGTACAAG |
|---|
"
],
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 80
- },
- "id": "HUELvWKMFtIO",
- "outputId": "976bab6f-f1fc-4c5a-c69c-8de02fc838d0"
- },
- "execution_count": null,
- "outputs": [
- {
- "output_type": "execute_result",
- "data": {
- "text/plain": [
- ""
- ],
- "text/html": [
- "| x: | ATGGTGAGCAAGGGCGAGGAGGATAACATGGCCATCATCAAGGAGTTCATGCGCTTCAAGGTGC-A-CATGGAGGGCTCCGTGAACGGCCACGAGTTCGAGATCGAG---GGCGAGGGCGAGGGC--CGCC-CCTACGAGGGCACCCAGACCGC-CAAGCTGAAGGTG-ACCA-AGG---G-TGGCC---CCCT-GCCCTTCGCCT-GGGA-CATCCTGTCC--C--C-T-CAGTTCATGT-A-CGGCT-CCAAGGCCTACGTG-A--AGCAC--C--C--C--G-CCGACATCCCCG-A--CTAC-T--TGAAGCTG-TCCTTC--C--C-----CGA-GG--GCTTCAAGTGGGAGCG-CGTGATGAACTTCGAGGACGGCGGCGTGGTG-ACCG--T-GA-C-CCAGGAC-TC--CTCCCTGCAGGACGGCGAGTTCATCTACAAGGTG---AAGCTGCGCGGCACCAACTTCCCCT-CCGACGGCCCCGTA-ATGCA-GAAGAAGACCATGGGCTG--GGA-GGCCTCCTCCGAGCGGATGTACCCCGAGGA-CGGCGCC-CTGAAGGGCGAGATCAAGCAGA-GGCTGAAGC-TGAAGGACGGCGGCCACTACGACGCTGAGGTCAAGACCACCTACA-AGGCCAAGAAG-CCCGTGCAGCTGCCCGGC-GCCTACAACGTCAACATCAAGT-TG----GA-CATCACCTCCCACAACGAGGA-CTAC-A-C-CA---T-C-G-TGGAACAGTACG-AACGCGCCGAGGGCCGCCACTCCAC-CGGCGGCATGGACGAGCTGTACAAG |
|---|
| y: | ATGGTGAGCAAGGGCGAGGA-G----C-T-G--TTCA-C-CGG-GGTGGTGCCCATCCTGGT-CGAGC-TGGACGGCGACGTAAACGGCCACAAGTTC-AG--CGTGTCCGGCGAGGGCGAGGGCGATGCCACCTAC---GGCAAGCTGACC-CTGAAG-TTCATTTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCC-AC-CCTCGTGACCACCCTGACCTACGGCGTGCAGTGC-T-TCAGCCGCTACCCCGACC-ACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGC-GCACCATCTTCTTCAAGGACGACGGCAACTACAAGA-CCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGC-A--ACATC--C-TGGGGCACAAGCTG-G-AGTA-CAACTACAACAGCC-ACAACGTC-TATAT-CATG--GCCGA-CAA--GCAGAAGAACGG-CA--T-C-A-AGG-TGAACTTC-AAGATC--CGCCAC--AA---C---ATCGAG--GACGGC---AGCGTGCAGCTCGCCGACCACTACCA-GC--A-G--AACACC-CC--CATCGGCGACG--GCCCCGTGCTGCTGCCCGACAACC-ACTACCTGAGCACCCAGTCCGCCCTGAGCAA-A-GACCC-CAACGAGAAGC-GCGATCACATGGTCCTGCTGG---AGTTCGTGAC-CGCC----GCCGGGA-T-CACTC-TCGGCATGGACGAGCTGTACAAG |
|---|
"
- ]
- },
- "metadata": {},
- "execution_count": 15
- }
+ "text/plain": [
+ ""
]
- },
- {
- "cell_type": "markdown",
- "source": [
- "# Exercice 3 : Distribution des scores d’alignement pour des séquences aléatoires\n",
- "\n",
- "Pour tester si un alignement reflète une réelle similarité biologique, on va évaluer la distribution des scores d’alignement pour des paires de séquences aléatoires."
- ],
- "metadata": {
- "id": "Q5jVeLfgMMtA"
- }
- },
- {
- "cell_type": "markdown",
- "source": [
- "Q1. En considérant deux séquences aléatoires de même taille N, où chaque nucléotide apparaît avec une probabilité uniforme de ¼, calculer le score moyen attendu pour une superposition sans trou dans le cas où une identité vaut +1 et une différence vaut 0."
- ],
- "metadata": {
- "id": "6xyXw0HsMQGf"
- }
- },
- {
- "cell_type": "markdown",
- "source": [
- "```markdown\n",
- "Votre réponse ici\n",
- "```"
- ],
- "metadata": {
- "id": "meF18gt-Mhcn"
- }
- },
- {
- "cell_type": "markdown",
- "source": [
- "Q2. La question précédente peut se resoudre analytiquement car on ne considère pas de trou. Pour étendre le résultat precedent à un alignement avec trous, on va se baser sur la simulation de séquences aleatoires.\n",
- "\n",
- "Générez $R$ paires de séquences aléatoires de tailles $N$ avec des probabilitées uniformes d'apparition de nucléotides $p_A = p_T = p_G = p_C = $ ¼. Affichez sous forme de violinplots les distribution des scores d'alignements entre chaque paire, obtenu par :\n",
- " 1. un alignement sans trou (cf. Q1) ;\n",
- " 2. un alignement local via Smith-Waterman (utilisez le code de l'exercice précédent)\n",
- "\n",
- "Utilisez le schéma d'évaluation suivant :"
- ],
- "metadata": {
- "id": "fP5_mHnYMkNI"
- }
- },
- {
- "cell_type": "code",
- "source": [
- "rmap = {\"A\": 0, \"T\": 1, \"G\": 2, \"C\": 3}\n",
- "sigma = np.array([[1, -0.5, -0.5, -0.5],\n",
- " [-0.5, 1, -0.5, -0.5],\n",
- " [-0.5, -0.5, 1, -0.5],\n",
- " [-0.5, -0.5, -0.5, 1]])\n",
- "go =0\n",
- "ge = 0.5"
- ],
- "metadata": {
- "id": "akUVqotnOLkH"
- },
- "execution_count": null,
- "outputs": []
- },
- {
- "cell_type": "code",
- "source": [
- "#Votre code ici"
- ],
- "metadata": {
- "id": "UX0afNaqOVZ2"
- },
- "execution_count": null,
- "outputs": []
- },
- {
- "cell_type": "markdown",
- "source": [
- "Q3. Qu'observez-vous ?"
- ],
- "metadata": {
- "id": "UNn9fUuXO4Le"
- }
- },
- {
- "cell_type": "markdown",
- "source": [
- "```markdown\n",
- "Votre réponse ici\n",
- "```"
- ],
- "metadata": {
- "id": "dSQEl0XXO8IG"
- }
- },
- {
- "cell_type": "markdown",
- "source": [
- "Q4. Quelle conclusion peut-on en tirer sur la significativité d'un alignement ?"
- ],
- "metadata": {
- "id": "xHfVXpQhf15n"
- }
- },
- {
- "cell_type": "markdown",
- "source": [
- "```markdown\n",
- "Votre réponse ici\n",
- "```"
- ],
- "metadata": {
- "id": "5KjhEeHDgDns"
- }
+ },
+ "execution_count": 49,
+ "metadata": {},
+ "output_type": "execute_result"
}
- ]
-}
\ No newline at end of file
+ ],
+ "source": [
+ "from IPython.display import HTML\n",
+ "HTML(\"| x: | ATGGTGAGCAAGGGCGAGGAGGATAACATGGCCATCATCAAGGAGTTCATGCGCTTCAAGGTGC-A-CATGGAGGGCTCCGTGAACGGCCACGAGTTCGAGATCGAG---GGCGAGGGCGAGGGC--CGCC-CCTACGAGGGCACCCAGACCGC-CAAGCTGAAGGTG-ACCA-AGG---G-TGGCC---CCCT-GCCCTTCGCCT-GGGA-CATCCTGTCC--C--C-T-CAGTTCATGT-A-CGGCT-CCAAGGCCTACGTG-A--AGCAC--C--C--C--G-CCGACATCCCCG-A--CTAC-T--TGAAGCTG-TCCTTC--C--C-----CGA-GG--GCTTCAAGTGGGAGCG-CGTGATGAACTTCGAGGACGGCGGCGTGGTG-ACCG--T-GA-C-CCAGGAC-TC--CTCCCTGCAGGACGGCGAGTTCATCTACAAGGTG---AAGCTGCGCGGCACCAACTTCCCCT-CCGACGGCCCCGTA-ATGCA-GAAGAAGACCATGGGCTG--GGA-GGCCTCCTCCGAGCGGATGTACCCCGAGGA-CGGCGCC-CTGAAGGGCGAGATCAAGCAGA-GGCTGAAGC-TGAAGGACGGCGGCCACTACGACGCTGAGGTCAAGACCACCTACA-AGGCCAAGAAG-CCCGTGCAGCTGCCCGGC-GCCTACAACGTCAACATCAAGT-TG----GA-CATCACCTCCCACAACGAGGA-CTAC-A-C-CA---T-C-G-TGGAACAGTACG-AACGCGCCGAGGGCCGCCACTCCAC-CGGCGGCATGGACGAGCTGTACAAG |
|---|
| y: | ATGGTGAGCAAGGGCGAGGA-G----C-T-G--TTCA-C-CGG-GGTGGTGCCCATCCTGGT-CGAGC-TGGACGGCGACGTAAACGGCCACAAGTTC-AG--CGTGTCCGGCGAGGGCGAGGGCGATGCCACCTAC---GGCAAGCTGACC-CTGAAG-TTCATTTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCC-AC-CCTCGTGACCACCCTGACCTACGGCGTGCAGTGC-T-TCAGCCGCTACCCCGACC-ACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGC-GCACCATCTTCTTCAAGGACGACGGCAACTACAAGA-CCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGC-A--ACATC--C-TGGGGCACAAGCTG-G-AGTA-CAACTACAACAGCC-ACAACGTC-TATAT-CATG--GCCGA-CAA--GCAGAAGAACGG-CA--T-C-A-AGG-TGAACTTC-AAGATC--CGCCAC--AA---C---ATCGAG--GACGGC---AGCGTGCAGCTCGCCGACCACTACCA-GC--A-G--AACACC-CC--CATCGGCGACG--GCCCCGTGCTGCTGCCCGACAACC-ACTACCTGAGCACCCAGTCCGCCCTGAGCAA-A-GACCC-CAACGAGAAGC-GCGATCACATGGTCCTGCTGG---AGTTCGTGAC-CGCC----GCCGGGA-T-CACTC-TCGGCATGGACGAGCTGTACAAG |
|---|
\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Q5jVeLfgMMtA"
+ },
+ "source": [
+ "# Exercice 3 : Distribution des scores d’alignement pour des séquences aléatoires\n",
+ "\n",
+ "Pour tester si un alignement reflète une réelle similarité biologique, on va évaluer la distribution des scores d’alignement pour des paires de séquences aléatoires."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6xyXw0HsMQGf"
+ },
+ "source": [
+ "Q1. En considérant deux séquences aléatoires de même taille N, où chaque nucléotide apparaît avec une probabilité uniforme de ¼, calculer le score moyen attendu pour une superposition sans trou dans le cas où une identité vaut +1 et une différence vaut 0."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "meF18gt-Mhcn"
+ },
+ "source": [
+ "```markdown\n",
+ "Le score pour chaque position identique est +1 et pour chaque position différente 0. Donc, le score moyen attendu pour la superposition sans trou est : N * 1/4 * 1 + N * 3/4 * 0 = N/4\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "fP5_mHnYMkNI"
+ },
+ "source": [
+ "Q2. La question précédente peut se resoudre analytiquement car on ne considère pas de trou. Pour étendre le résultat precedent à un alignement avec trous, on va se baser sur la simulation de séquences aleatoires.\n",
+ "\n",
+ "Générez $R$ paires de séquences aléatoires de tailles $N$ avec des probabilitées uniformes d'apparition de nucléotides $p_A = p_T = p_G = p_C = $ ¼. Affichez sous forme de violinplots les distribution des scores d'alignements entre chaque paire, obtenu par :\n",
+ " 1. un alignement sans trou (cf. Q1) ;\n",
+ " 2. un alignement local via Smith-Waterman (utilisez le code de l'exercice précédent)\n",
+ "\n",
+ "Utilisez le schéma d'évaluation suivant :"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "akUVqotnOLkH"
+ },
+ "outputs": [],
+ "source": [
+ "rmap = {\"A\": 0, \"T\": 1, \"G\": 2, \"C\": 3}\n",
+ "sigma = np.array([[1, -0.5, -0.5, -0.5],\n",
+ " [-0.5, 1, -0.5, -0.5],\n",
+ " [-0.5, -0.5, 1, -0.5],\n",
+ " [-0.5, -0.5, -0.5, 1]])\n",
+ "go =0\n",
+ "ge = 0.5"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "UX0afNaqOVZ2"
+ },
+ "outputs": [],
+ "source": [
+ "#Votre code ici"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "UNn9fUuXO4Le"
+ },
+ "source": [
+ "Q3. Qu'observez-vous ?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "dSQEl0XXO8IG"
+ },
+ "source": [
+ "```markdown\n",
+ "Votre réponse ici\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "xHfVXpQhf15n"
+ },
+ "source": [
+ "Q4. Quelle conclusion peut-on en tirer sur la significativité d'un alignement ?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "5KjhEeHDgDns"
+ },
+ "source": [
+ "```markdown\n",
+ "Votre réponse ici\n",
+ "```"
+ ]
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "authorship_tag": "ABX9TyNSXnqaXAUgZK9rmJ1TWbGo",
+ "provenance": []
+ },
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.12"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}