diff --git a/.ipynb_checkpoints/BINF2025_TP3-checkpoint.ipynb b/.ipynb_checkpoints/BINF2025_TP3-checkpoint.ipynb new file mode 100644 index 0000000..61e87c2 --- /dev/null +++ b/.ipynb_checkpoints/BINF2025_TP3-checkpoint.ipynb @@ -0,0 +1,481 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [], + "authorship_tag": "ABX9TyNSXnqaXAUgZK9rmJ1TWbGo" + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# BINF TP3 - Algorithmes d'alignement par paire" + ], + "metadata": { + "id": "V09wQ1WIOmgn" + } + }, + { + "cell_type": "markdown", + "source": [ + "Dans ce TP nous allons manipuler les algorithmes d'alignement par paire." + ], + "metadata": { + "id": "er6CtAyOxC6F" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Exercice 0 - Echauffement" + ], + "metadata": { + "id": "BqEa3BJ1xICM" + } + }, + { + "cell_type": "markdown", + "source": [ + "Q1. Donnez le score de la superposition :\n", + "\n", + "| | |\n", + "| :---: | :---: |\n", + "x | ATGTCATGA---TAC |\n", + "y | AT--CTAAATGTTAC |\n", + "\n", + "\n", + "étant donne le schéma d'évaluation :\n", + "\n", + "| | A | T | G | C |\n", + "| :---: | :---: | :---: | :---: | :---: |\n", + "| **A** | 1 | -1 | -1 | -1 |\n", + "| **T** | -1 | 1 | -1 | -1 |\n", + "| **G** | -1 | -1 | 1 | -1 |\n", + "| **C** | -1 | -1 | -1 | 1 |\n", + "\n", + "et\n", + "\n", + "$\\gamma(g) = 0.5 |g| + 0.5$" + ], + "metadata": { + "id": "qqiiq5bcxYvM" + } + }, + { + "cell_type": "markdown", + "source": [ + "```markdown\n", + "Votre réponse ici\n", + "```" + ], + "metadata": { + "id": "kCJGGGYQ2GNi" + } + }, + { + "cell_type": "markdown", + "source": [ + "Q2. Alignez les séquences suivantes avec l'algorithme de Levenshtein : x = ATG et y = ACTG." + ], + "metadata": { + "id": "XyhXAhK-2NKJ" + } + }, + { + "cell_type": "markdown", + "source": [ + "```markdown\n", + "Votre réponse ici\n", + "```" + ], + "metadata": { + "id": "b9iovhyZ2bXw" + } + }, + { + "cell_type": "markdown", + "source": [ + "Q3.\tAlignez les séquences suivantes avec l'algorithme de Needleman-Wunsch global x = TAT et y = ATGAC en considérant le schéma d'évaluation suivant\n", + "\n", + "| | A | T | G | C |\n", + "| :---: | :---: | :---: | :---: | :---: |\n", + "| **A** | 1 | -0.5 | -0.5 | -0.5 |\n", + "| **T** | -0.5 | 1 | -0.5 | -0.5 |\n", + "| **G** | -0.5 | -0.5 | 1 | -0.5 |\n", + "| **C** | -0.5 | -0.5 | -0.5 | 1 |\n", + "\n", + "et\n", + "\n", + "$\\gamma(g) = 0.5 |g|$\n" + ], + "metadata": { + "id": "OV_YaQHr2elB" + } + }, + { + "cell_type": "markdown", + "source": [ + "```markdown\n", + "Votre réponse ici\n", + "```" + ], + "metadata": { + "id": "g_MrecVs3Nrw" + } + }, + { + "cell_type": "markdown", + "source": [ + "Q4. Alignez les séquences suivantes avec l'algorithme de Smith-Waterman x = TTGG y = ATGAC en utilisant le schéma d'évaluation de la question précédente.\n" + ], + "metadata": { + "id": "y1YF-G6E3Qoo" + } + }, + { + "cell_type": "markdown", + "source": [ + "```markdown\n", + "Votre réponse ici\n", + "```" + ], + "metadata": { + "id": "LLMECocb3pgI" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Exercice 1 : Algorithme de Levenshtein - version récursive" + ], + "metadata": { + "id": "46gw0avh3wGw" + } + }, + { + "cell_type": "markdown", + "source": [ + "Q1. Ecrivez une fonction\n", + "\n", + "levenshtein(x: str, y: str) -> int\n", + "\n", + "qui retourne la distance de Levenshtein entre les séquences x et y en utilisant la version récursive de l'algorithme." + ], + "metadata": { + "id": "ZKc09Kyg4a6v" + } + }, + { + "cell_type": "code", + "source": [ + "#Votre code ici" + ], + "metadata": { + "id": "FJR69IEQ4aHv" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Q2. Vous pouvez tester votre code sur les exemples suivants:\n", + "\n", + "\n", + "* $L('CCAG', 'CA') = 2$\n", + "* $L('CCGT', 'CGTCA') = 3$\n", + "* $L(AY678264^*, OQ870305^*) = 310$\n", + "\n", + "$^*$ ids genbank de deux sequences." + ], + "metadata": { + "id": "arFVwA6E5NWn" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Exercice 2 : Algorithme de Smith-Waterman - version itérative" + ], + "metadata": { + "id": "erCpfG7O7BV-" + } + }, + { + "cell_type": "markdown", + "source": [ + "Q1. Ecrivez la fonction\n", + "\n", + "sw_fwd(x: str, y: str, cmap: dict, sigma: array, (go, ge): list) -> (array, array)\n", + "\n", + "qui construit les matrices $S$ et $B$ en utilisant l'algorithme de Smith-Waterman pour aligner les séquences x et y suivant le schéma d'évaluation donné par la matrice de substitution $\\Sigma$ et la fonction d'évaluation des trous $\\gamma(n)= g_o + g_e \\times n$. Le dictionnaire cmap donne la position des différents nucléotides dans la matrice $\\Sigma$. La fonction retourne la paire de matrices de score $S$ et de retour $B$." + ], + "metadata": { + "id": "rv2Y78y37IOd" + } + }, + { + "cell_type": "code", + "source": [ + "#Votre code ici" + ], + "metadata": { + "id": "njn3JB0b-WHj" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Q2. Ecrivez la fonction\n", + "\n", + "sw_bwd(x: str, y: str, S: array, B: array) -> (str, str, float)\n", + "\n", + "qui effectue l'etape de retour de l'algorithme de Smith-Waterman etant donné les séquences $x$ et $y$ et les matrices de score $S$ et de retour $B$. La fonction retourne un tuple contenant les alignements des séquences x et y et le score de l'alignement." + ], + "metadata": { + "id": "55n8mt9U-Wai" + } + }, + { + "cell_type": "code", + "source": [ + "#Votre code ici" + ], + "metadata": { + "id": "ij9JDpBm_UZ7" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Q3. Vous pouvez tester votre code en utilisant le schéma d'évaluation suivant :" + ], + "metadata": { + "id": "kwmxg2dxAiwS" + } + }, + { + "cell_type": "code", + "source": [ + "cmap = {\"A\": 0, \"T\": 1, \"G\": 2, \"C\": 3}\n", + "m = np.array([[1, -0.5, -0.5, -0.5],\n", + " [-0.5, 1, -0.5, -0.5],\n", + " [-0.5, -0.5, 1, -0.5],\n", + " [-0.5, -0.5, -0.5, 1]])\n", + "go = 0\n", + "ge = 0.5" + ], + "metadata": { + "id": "JUtYRFTBAwwZ" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "* $SW('TCGC', 'CTTAG')$ retourne un score de $1.5$ à la position $(3,5)$ et l'alignement" + ], + "metadata": { + "id": "eMGh4K5aIFxE" + } + }, + { + "cell_type": "code", + "source": [ + "HTML(\"
x:TCG
y:TAG
\")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 60 + }, + "id": "joHNwJ9AIf6F", + "outputId": "a9206810-a083-4d86-8b14-38183f1dd80c" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "
x:TCG
y:TAG
" + ] + }, + "metadata": {}, + "execution_count": 18 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "* $SW(AY678264^*, OQ870305^*)$ retourne un score de $342.1$ à la position $(708,717)$ et l'alignement" + ], + "metadata": { + "id": "JJlU5yvZI43D" + } + }, + { + "cell_type": "code", + "source": [ + "from IPython.display import HTML\n", + "HTML(\"
x:ATGGTGAGCAAGGGCGAGGAGGATAACATGGCCATCATCAAGGAGTTCATGCGCTTCAAGGTGC-A-CATGGAGGGCTCCGTGAACGGCCACGAGTTCGAGATCGAG---GGCGAGGGCGAGGGC--CGCC-CCTACGAGGGCACCCAGACCGC-CAAGCTGAAGGTG-ACCA-AGG---G-TGGCC---CCCT-GCCCTTCGCCT-GGGA-CATCCTGTCC--C--C-T-CAGTTCATGT-A-CGGCT-CCAAGGCCTACGTG-A--AGCAC--C--C--C--G-CCGACATCCCCG-A--CTAC-T--TGAAGCTG-TCCTTC--C--C-----CGA-GG--GCTTCAAGTGGGAGCG-CGTGATGAACTTCGAGGACGGCGGCGTGGTG-ACCG--T-GA-C-CCAGGAC-TC--CTCCCTGCAGGACGGCGAGTTCATCTACAAGGTG---AAGCTGCGCGGCACCAACTTCCCCT-CCGACGGCCCCGTA-ATGCA-GAAGAAGACCATGGGCTG--GGA-GGCCTCCTCCGAGCGGATGTACCCCGAGGA-CGGCGCC-CTGAAGGGCGAGATCAAGCAGA-GGCTGAAGC-TGAAGGACGGCGGCCACTACGACGCTGAGGTCAAGACCACCTACA-AGGCCAAGAAG-CCCGTGCAGCTGCCCGGC-GCCTACAACGTCAACATCAAGT-TG----GA-CATCACCTCCCACAACGAGGA-CTAC-A-C-CA---T-C-G-TGGAACAGTACG-AACGCGCCGAGGGCCGCCACTCCAC-CGGCGGCATGGACGAGCTGTACAAG
y:ATGGTGAGCAAGGGCGAGGA-G----C-T-G--TTCA-C-CGG-GGTGGTGCCCATCCTGGT-CGAGC-TGGACGGCGACGTAAACGGCCACAAGTTC-AG--CGTGTCCGGCGAGGGCGAGGGCGATGCCACCTAC---GGCAAGCTGACC-CTGAAG-TTCATTTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCC-AC-CCTCGTGACCACCCTGACCTACGGCGTGCAGTGC-T-TCAGCCGCTACCCCGACC-ACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGC-GCACCATCTTCTTCAAGGACGACGGCAACTACAAGA-CCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGC-A--ACATC--C-TGGGGCACAAGCTG-G-AGTA-CAACTACAACAGCC-ACAACGTC-TATAT-CATG--GCCGA-CAA--GCAGAAGAACGG-CA--T-C-A-AGG-TGAACTTC-AAGATC--CGCCAC--AA---C---ATCGAG--GACGGC---AGCGTGCAGCTCGCCGACCACTACCA-GC--A-G--AACACC-CC--CATCGGCGACG--GCCCCGTGCTGCTGCCCGACAACC-ACTACCTGAGCACCCAGTCCGCCCTGAGCAA-A-GACCC-CAACGAGAAGC-GCGATCACATGGTCCTGCTGG---AGTTCGTGAC-CGCC----GCCGGGA-T-CACTC-TCGGCATGGACGAGCTGTACAAG
\")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 80 + }, + "id": "HUELvWKMFtIO", + "outputId": "976bab6f-f1fc-4c5a-c69c-8de02fc838d0" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "
x:ATGGTGAGCAAGGGCGAGGAGGATAACATGGCCATCATCAAGGAGTTCATGCGCTTCAAGGTGC-A-CATGGAGGGCTCCGTGAACGGCCACGAGTTCGAGATCGAG---GGCGAGGGCGAGGGC--CGCC-CCTACGAGGGCACCCAGACCGC-CAAGCTGAAGGTG-ACCA-AGG---G-TGGCC---CCCT-GCCCTTCGCCT-GGGA-CATCCTGTCC--C--C-T-CAGTTCATGT-A-CGGCT-CCAAGGCCTACGTG-A--AGCAC--C--C--C--G-CCGACATCCCCG-A--CTAC-T--TGAAGCTG-TCCTTC--C--C-----CGA-GG--GCTTCAAGTGGGAGCG-CGTGATGAACTTCGAGGACGGCGGCGTGGTG-ACCG--T-GA-C-CCAGGAC-TC--CTCCCTGCAGGACGGCGAGTTCATCTACAAGGTG---AAGCTGCGCGGCACCAACTTCCCCT-CCGACGGCCCCGTA-ATGCA-GAAGAAGACCATGGGCTG--GGA-GGCCTCCTCCGAGCGGATGTACCCCGAGGA-CGGCGCC-CTGAAGGGCGAGATCAAGCAGA-GGCTGAAGC-TGAAGGACGGCGGCCACTACGACGCTGAGGTCAAGACCACCTACA-AGGCCAAGAAG-CCCGTGCAGCTGCCCGGC-GCCTACAACGTCAACATCAAGT-TG----GA-CATCACCTCCCACAACGAGGA-CTAC-A-C-CA---T-C-G-TGGAACAGTACG-AACGCGCCGAGGGCCGCCACTCCAC-CGGCGGCATGGACGAGCTGTACAAG
y:ATGGTGAGCAAGGGCGAGGA-G----C-T-G--TTCA-C-CGG-GGTGGTGCCCATCCTGGT-CGAGC-TGGACGGCGACGTAAACGGCCACAAGTTC-AG--CGTGTCCGGCGAGGGCGAGGGCGATGCCACCTAC---GGCAAGCTGACC-CTGAAG-TTCATTTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCC-AC-CCTCGTGACCACCCTGACCTACGGCGTGCAGTGC-T-TCAGCCGCTACCCCGACC-ACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGC-GCACCATCTTCTTCAAGGACGACGGCAACTACAAGA-CCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGC-A--ACATC--C-TGGGGCACAAGCTG-G-AGTA-CAACTACAACAGCC-ACAACGTC-TATAT-CATG--GCCGA-CAA--GCAGAAGAACGG-CA--T-C-A-AGG-TGAACTTC-AAGATC--CGCCAC--AA---C---ATCGAG--GACGGC---AGCGTGCAGCTCGCCGACCACTACCA-GC--A-G--AACACC-CC--CATCGGCGACG--GCCCCGTGCTGCTGCCCGACAACC-ACTACCTGAGCACCCAGTCCGCCCTGAGCAA-A-GACCC-CAACGAGAAGC-GCGATCACATGGTCCTGCTGG---AGTTCGTGAC-CGCC----GCCGGGA-T-CACTC-TCGGCATGGACGAGCTGTACAAG
" + ] + }, + "metadata": {}, + "execution_count": 15 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Exercice 3 : Distribution des scores d’alignement pour des séquences aléatoires\n", + "\n", + "Pour tester si un alignement reflète une réelle similarité biologique, on va évaluer la distribution des scores d’alignement pour des paires de séquences aléatoires." + ], + "metadata": { + "id": "Q5jVeLfgMMtA" + } + }, + { + "cell_type": "markdown", + "source": [ + "Q1. En considérant deux séquences aléatoires de même taille N, où chaque nucléotide apparaît avec une probabilité uniforme de ¼, calculer le score moyen attendu pour une superposition sans trou dans le cas où une identité vaut +1 et une différence vaut 0." + ], + "metadata": { + "id": "6xyXw0HsMQGf" + } + }, + { + "cell_type": "markdown", + "source": [ + "```markdown\n", + "Votre réponse ici\n", + "```" + ], + "metadata": { + "id": "meF18gt-Mhcn" + } + }, + { + "cell_type": "markdown", + "source": [ + "Q2. La question précédente peut se resoudre analytiquement car on ne considère pas de trou. Pour étendre le résultat precedent à un alignement avec trous, on va se baser sur la simulation de séquences aleatoires.\n", + "\n", + "Générez $R$ paires de séquences aléatoires de tailles $N$ avec des probabilitées uniformes d'apparition de nucléotides $p_A = p_T = p_G = p_C = $ ¼. Affichez sous forme de violinplots les distribution des scores d'alignements entre chaque paire, obtenu par :\n", + " 1. un alignement sans trou (cf. Q1) ;\n", + " 2. un alignement local via Smith-Waterman (utilisez le code de l'exercice précédent)\n", + "\n", + "Utilisez le schéma d'évaluation suivant :" + ], + "metadata": { + "id": "fP5_mHnYMkNI" + } + }, + { + "cell_type": "code", + "source": [ + "rmap = {\"A\": 0, \"T\": 1, \"G\": 2, \"C\": 3}\n", + "sigma = np.array([[1, -0.5, -0.5, -0.5],\n", + " [-0.5, 1, -0.5, -0.5],\n", + " [-0.5, -0.5, 1, -0.5],\n", + " [-0.5, -0.5, -0.5, 1]])\n", + "go =0\n", + "ge = 0.5" + ], + "metadata": { + "id": "akUVqotnOLkH" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "#Votre code ici" + ], + "metadata": { + "id": "UX0afNaqOVZ2" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Q3. Qu'observez-vous ?" + ], + "metadata": { + "id": "UNn9fUuXO4Le" + } + }, + { + "cell_type": "markdown", + "source": [ + "```markdown\n", + "Votre réponse ici\n", + "```" + ], + "metadata": { + "id": "dSQEl0XXO8IG" + } + }, + { + "cell_type": "markdown", + "source": [ + "Q4. Quelle conclusion peut-on en tirer sur la significativité d'un alignement ?" + ], + "metadata": { + "id": "xHfVXpQhf15n" + } + }, + { + "cell_type": "markdown", + "source": [ + "```markdown\n", + "Votre réponse ici\n", + "```" + ], + "metadata": { + "id": "5KjhEeHDgDns" + } + } + ] +} \ No newline at end of file diff --git a/BINF2025_TP3.ipynb b/BINF2025_TP3.ipynb index 61e87c2..36e5475 100644 --- a/BINF2025_TP3.ipynb +++ b/BINF2025_TP3.ipynb @@ -1,481 +1,591 @@ { - "nbformat": 4, - "nbformat_minor": 0, - "metadata": { - "colab": { - "provenance": [], - "authorship_tag": "ABX9TyNSXnqaXAUgZK9rmJ1TWbGo" - }, - "kernelspec": { - "name": "python3", - "display_name": "Python 3" - }, - "language_info": { - "name": "python" + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "V09wQ1WIOmgn" + }, + "source": [ + "# BINF TP3 - Algorithmes d'alignement par paire" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "er6CtAyOxC6F" + }, + "source": [ + "Dans ce TP nous allons manipuler les algorithmes d'alignement par paire." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BqEa3BJ1xICM" + }, + "source": [ + "# Exercice 0 - Echauffement" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qqiiq5bcxYvM" + }, + "source": [ + "Q1. Donnez le score de la superposition :\n", + "\n", + "| | |\n", + "| :---: | :---: |\n", + "x | ATGTCATGA---TAC |\n", + "y | AT--CTAAATGTTAC |\n", + "\n", + "\n", + "étant donne le schéma d'évaluation :\n", + "\n", + "| | A | T | G | C |\n", + "| :---: | :---: | :---: | :---: | :---: |\n", + "| **A** | 1 | -1 | -1 | -1 |\n", + "| **T** | -1 | 1 | -1 | -1 |\n", + "| **G** | -1 | -1 | 1 | -1 |\n", + "| **C** | -1 | -1 | -1 | 1 |\n", + "\n", + "et\n", + "\n", + "$\\gamma(g) = 0.5 |g| + 0.5$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kCJGGGYQ2GNi" + }, + "source": [ + "```markdown\n", + "\n", + "A T G T C A T G A - - - T A C\n", + "A T - - C T A A A T G T T A C\n", + "\n", + "s(x,y) = sig_A,A + sig_T,T + sig_C,C + sig_A,T + sig_T,A + sig_G,A + sig_A,A + sig_T,T + sig_A,A + sig_C,C + gamma(3) + gamma(2)\n", + "s(x,y) = 1 + 1 + 1 + -1 + -1 + -1 + 1 + 1 + 1 + 1 + 2 + 1\n", + "s(x,y) = 7\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XyhXAhK-2NKJ" + }, + "source": [ + "Q2. Alignez les séquences suivantes avec l'algorithme de Levenshtein : x = ATG et y = ACTG." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "b9iovhyZ2bXw" + }, + "source": [ + "```markdown\n", + "x = ATC\n", + "y = ACTG\n", + "\n", + "S = \n", + "| | 0 | A | C | T | G |\n", + "| :---: | :---: | :---: | :---: | :---: | :---: |\n", + "| **0** | 0 | 1 | 2 | 3 | 4 |\n", + "| **A** | 1 | 0 | 1 | 2 | 3 |\n", + "| **T** | 2 | 1 | 1 | 1 | 2 |\n", + "| **G** | 3 | 2 | 2 | 2 | 1 |\n", + "\n", + "\n", + "B = 0 A C T G\n", + "0 [ 0 < < < < ]\n", + "A | ^ \\ < < < |\n", + "T | ^ ^ \\ \\ < |\n", + "G [ ^ ^ \\ \\ \\ ]\n", + "\n", + "Donc\n", + "d_L (x,y) = 1\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OV_YaQHr2elB" + }, + "source": [ + "Q3.\tAlignez les séquences suivantes avec l'algorithme de Needleman-Wunsch global x = TAT et y = ATGAC en considérant le schéma d'évaluation suivant\n", + "\n", + "| | A | T | G | C |\n", + "| :---: | :---: | :---: | :---: | :---: |\n", + "| **A** | 1 | -0.5 | -0.5 | -0.5 |\n", + "| **T** | -0.5 | 1 | -0.5 | -0.5 |\n", + "| **G** | -0.5 | -0.5 | 1 | -0.5 |\n", + "| **C** | -0.5 | -0.5 | -0.5 | 1 |\n", + "\n", + "et\n", + "\n", + "$\\gamma(g) = 0.5 |g|$\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "g_MrecVs3Nrw" + }, + "source": [ + "```markdown\n", + "On a:\n", + "𝑠𝑖,0 = -𝛾(𝑖)\n", + "𝑠0,𝑗 = -𝛾(𝑗)\n", + "𝑠𝑖,𝑗 = 𝑚𝑎𝑥 {\n", + " 𝜎𝑥𝑖−1,𝑦𝑗−1 + 𝑠𝑖−1,𝑗−1\n", + " -𝛾(1) + 𝑠𝑖−1,𝑗\n", + " -𝛾(1) + 𝑠𝑖,𝑗−1\n", + "}\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "y1YF-G6E3Qoo" + }, + "source": [ + "Q4. Alignez les séquences suivantes avec l'algorithme de Smith-Waterman x = TTGG y = ATGAC en utilisant le schéma d'évaluation de la question précédente.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LLMECocb3pgI" + }, + "source": [ + "```markdown\n", + "Votre réponse ici\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "46gw0avh3wGw" + }, + "source": [ + "# Exercice 1 : Algorithme de Levenshtein - version récursive" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZKc09Kyg4a6v" + }, + "source": [ + "Q1. Ecrivez une fonction\n", + "\n", + "levenshtein(x: str, y: str) -> int\n", + "\n", + "qui retourne la distance de Levenshtein entre les séquences x et y en utilisant la version récursive de l'algorithme." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "id": "FJR69IEQ4aHv" + }, + "outputs": [], + "source": [ + "def levenshtein_r(x, y):\n", + " \"\"\" Recursively calculates the Levenshtein distance between two x and y sequences (str).\"\"\"\n", + " if not y:\n", + " return len(x)\n", + " if not x:\n", + " return len(y)\n", + " \n", + " syn_skip = levenshtein_r(x[1:], y[1:])\n", + " \n", + " if x[0] == y[0]:\n", + " return syn_skip\n", + " \n", + " l_skip = levenshtein_r(x[1:], y)\n", + " r_skip = levenshtein_r(x, y[1:])\n", + "\n", + " return 1 + min(syn_skip, l_skip, r_skip)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "arFVwA6E5NWn" + }, + "source": [ + "Q2. Vous pouvez tester votre code sur les exemples suivants:\n", + "\n", + "\n", + "* $L('CCAG', 'CA') = 2$\n", + "* $L('CCGT', 'CGTCA') = 3$\n", + "* $L(AY678264^*, OQ870305^*) = 310$\n", + "\n", + "$^*$ ids genbank de deux sequences." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "L('CCAG', 'CA') = 2\n", + "L('CCGT', 'CGTCA') = 3\n" + ] } + ], + "source": [ + "print(f\"L('CCAG', 'CA') = {levenshtein_r(\"CCAG\", \"CA\")}\")\n", + "print(f\"L('CCGT', 'CGTCA') = {levenshtein_r(\"CCGT\", \"CGTCA\")}\")\n", + "\n", + "import requests\n", + "import pandas as pd\n", + "\n", + "genbank_req = \"https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?db=nuccore&id=\"\n", + "genbank_id1 = \"AY678264\"\n", + "genbank_id2 = \"OQ870305\"\n", + "report_suf = \"&report=fasta\"\n", + "\n", + "try:\n", + " id1_genome = requests.get(f\"{genbank_req}{genbank_id1}{report_suf}\").text.split('\\n', 1)[1]\n", + " id2_genome = requests.get(f\"{genbank_req}{genbank_id2}{report_suf}\").text.split('\\n', 1)[1]\n", + "except:\n", + " print(\"Failed to get genomes\")\n", + "\n", + "print(f\"L({genbank_id1}, {genbank_id2}) = {levenshtein_r(id1_genome, id2_genome)}\")" + ] }, - "cells": [ - { - "cell_type": "markdown", - "source": [ - "# BINF TP3 - Algorithmes d'alignement par paire" - ], - "metadata": { - "id": "V09wQ1WIOmgn" - } - }, - { - "cell_type": "markdown", - "source": [ - "Dans ce TP nous allons manipuler les algorithmes d'alignement par paire." - ], - "metadata": { - "id": "er6CtAyOxC6F" - } - }, - { - "cell_type": "markdown", - "source": [ - "# Exercice 0 - Echauffement" - ], - "metadata": { - "id": "BqEa3BJ1xICM" - } - }, - { - "cell_type": "markdown", - "source": [ - "Q1. Donnez le score de la superposition :\n", - "\n", - "| | |\n", - "| :---: | :---: |\n", - "x | ATGTCATGA---TAC |\n", - "y | AT--CTAAATGTTAC |\n", - "\n", - "\n", - "étant donne le schéma d'évaluation :\n", - "\n", - "| | A | T | G | C |\n", - "| :---: | :---: | :---: | :---: | :---: |\n", - "| **A** | 1 | -1 | -1 | -1 |\n", - "| **T** | -1 | 1 | -1 | -1 |\n", - "| **G** | -1 | -1 | 1 | -1 |\n", - "| **C** | -1 | -1 | -1 | 1 |\n", - "\n", - "et\n", - "\n", - "$\\gamma(g) = 0.5 |g| + 0.5$" - ], - "metadata": { - "id": "qqiiq5bcxYvM" - } - }, - { - "cell_type": "markdown", - "source": [ - "```markdown\n", - "Votre réponse ici\n", - "```" - ], - "metadata": { - "id": "kCJGGGYQ2GNi" - } - }, - { - "cell_type": "markdown", - "source": [ - "Q2. Alignez les séquences suivantes avec l'algorithme de Levenshtein : x = ATG et y = ACTG." - ], - "metadata": { - "id": "XyhXAhK-2NKJ" - } - }, - { - "cell_type": "markdown", - "source": [ - "```markdown\n", - "Votre réponse ici\n", - "```" - ], - "metadata": { - "id": "b9iovhyZ2bXw" - } - }, - { - "cell_type": "markdown", - "source": [ - "Q3.\tAlignez les séquences suivantes avec l'algorithme de Needleman-Wunsch global x = TAT et y = ATGAC en considérant le schéma d'évaluation suivant\n", - "\n", - "| | A | T | G | C |\n", - "| :---: | :---: | :---: | :---: | :---: |\n", - "| **A** | 1 | -0.5 | -0.5 | -0.5 |\n", - "| **T** | -0.5 | 1 | -0.5 | -0.5 |\n", - "| **G** | -0.5 | -0.5 | 1 | -0.5 |\n", - "| **C** | -0.5 | -0.5 | -0.5 | 1 |\n", - "\n", - "et\n", - "\n", - "$\\gamma(g) = 0.5 |g|$\n" - ], - "metadata": { - "id": "OV_YaQHr2elB" - } - }, - { - "cell_type": "markdown", - "source": [ - "```markdown\n", - "Votre réponse ici\n", - "```" - ], - "metadata": { - "id": "g_MrecVs3Nrw" - } - }, - { - "cell_type": "markdown", - "source": [ - "Q4. Alignez les séquences suivantes avec l'algorithme de Smith-Waterman x = TTGG y = ATGAC en utilisant le schéma d'évaluation de la question précédente.\n" - ], - "metadata": { - "id": "y1YF-G6E3Qoo" - } - }, - { - "cell_type": "markdown", - "source": [ - "```markdown\n", - "Votre réponse ici\n", - "```" - ], - "metadata": { - "id": "LLMECocb3pgI" - } - }, - { - "cell_type": "markdown", - "source": [ - "# Exercice 1 : Algorithme de Levenshtein - version récursive" - ], - "metadata": { - "id": "46gw0avh3wGw" - } - }, - { - "cell_type": "markdown", - "source": [ - "Q1. Ecrivez une fonction\n", - "\n", - "levenshtein(x: str, y: str) -> int\n", - "\n", - "qui retourne la distance de Levenshtein entre les séquences x et y en utilisant la version récursive de l'algorithme." - ], - "metadata": { - "id": "ZKc09Kyg4a6v" - } - }, - { - "cell_type": "code", - "source": [ - "#Votre code ici" - ], - "metadata": { - "id": "FJR69IEQ4aHv" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "source": [ - "Q2. Vous pouvez tester votre code sur les exemples suivants:\n", - "\n", - "\n", - "* $L('CCAG', 'CA') = 2$\n", - "* $L('CCGT', 'CGTCA') = 3$\n", - "* $L(AY678264^*, OQ870305^*) = 310$\n", - "\n", - "$^*$ ids genbank de deux sequences." - ], - "metadata": { - "id": "arFVwA6E5NWn" - } - }, - { - "cell_type": "markdown", - "source": [ - "# Exercice 2 : Algorithme de Smith-Waterman - version itérative" - ], - "metadata": { - "id": "erCpfG7O7BV-" - } - }, - { - "cell_type": "markdown", - "source": [ - "Q1. Ecrivez la fonction\n", - "\n", - "sw_fwd(x: str, y: str, cmap: dict, sigma: array, (go, ge): list) -> (array, array)\n", - "\n", - "qui construit les matrices $S$ et $B$ en utilisant l'algorithme de Smith-Waterman pour aligner les séquences x et y suivant le schéma d'évaluation donné par la matrice de substitution $\\Sigma$ et la fonction d'évaluation des trous $\\gamma(n)= g_o + g_e \\times n$. Le dictionnaire cmap donne la position des différents nucléotides dans la matrice $\\Sigma$. La fonction retourne la paire de matrices de score $S$ et de retour $B$." - ], - "metadata": { - "id": "rv2Y78y37IOd" - } - }, - { - "cell_type": "code", - "source": [ - "#Votre code ici" - ], - "metadata": { - "id": "njn3JB0b-WHj" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "source": [ - "Q2. Ecrivez la fonction\n", - "\n", - "sw_bwd(x: str, y: str, S: array, B: array) -> (str, str, float)\n", - "\n", - "qui effectue l'etape de retour de l'algorithme de Smith-Waterman etant donné les séquences $x$ et $y$ et les matrices de score $S$ et de retour $B$. La fonction retourne un tuple contenant les alignements des séquences x et y et le score de l'alignement." - ], - "metadata": { - "id": "55n8mt9U-Wai" - } - }, - { - "cell_type": "code", - "source": [ - "#Votre code ici" - ], - "metadata": { - "id": "ij9JDpBm_UZ7" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "source": [ - "Q3. Vous pouvez tester votre code en utilisant le schéma d'évaluation suivant :" - ], - "metadata": { - "id": "kwmxg2dxAiwS" - } - }, - { - "cell_type": "code", - "source": [ - "cmap = {\"A\": 0, \"T\": 1, \"G\": 2, \"C\": 3}\n", - "m = np.array([[1, -0.5, -0.5, -0.5],\n", - " [-0.5, 1, -0.5, -0.5],\n", - " [-0.5, -0.5, 1, -0.5],\n", - " [-0.5, -0.5, -0.5, 1]])\n", - "go = 0\n", - "ge = 0.5" - ], - "metadata": { - "id": "JUtYRFTBAwwZ" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "source": [ - "* $SW('TCGC', 'CTTAG')$ retourne un score de $1.5$ à la position $(3,5)$ et l'alignement" - ], - "metadata": { - "id": "eMGh4K5aIFxE" - } + { + "cell_type": "markdown", + "metadata": { + "id": "erCpfG7O7BV-" + }, + "source": [ + "# Exercice 2 : Algorithme de Smith-Waterman - version itérative" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rv2Y78y37IOd" + }, + "source": [ + "Q1. Ecrivez la fonction\n", + "\n", + "sw_fwd(x: str, y: str, cmap: dict, sigma: array, (go, ge): list) -> (array, array)\n", + "\n", + "qui construit les matrices $S$ et $B$ en utilisant l'algorithme de Smith-Waterman pour aligner les séquences x et y suivant le schéma d'évaluation donné par la matrice de substitution $\\Sigma$ et la fonction d'évaluation des trous $\\gamma(n)= g_o + g_e \\times n$. Le dictionnaire cmap donne la position des différents nucléotides dans la matrice $\\Sigma$. La fonction retourne la paire de matrices de score $S$ et de retour $B$." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "njn3JB0b-WHj" + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import numpy.typing as npt\n", + "\n", + "def sw_fwd(x: str, y: str, cmap: dict, sigma: npt.NDarray[np.int8], gZ: np.int8, gE: np.int8) -> tuple[npt.NDarray[np.int8], npt.NDarray[np.int8]]:\n", + " \"\"\" Iteratively calculates the Smith-Waterman algorithm, going forwards.\n", + "\n", + " Returns a tuple of matrices (flattened to array form):\n", + " - S: Score Matrix\n", + " - B: Return Matrix\n", + " \"\"\"\n", + " \n", + " def gamma(n) -> np.int8:\n", + " return gZ + n * gE\n", + "\n", + " # Step 1: Determine the Substitution matrix and gap penalty scheme\n", + " \n", + " # S is of shape (n + 1, m + 1) with n the length of x and m the length of y\n", + " S = np.zeros(\n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "55n8mt9U-Wai" + }, + "source": [ + "Q2. Ecrivez la fonction\n", + "\n", + "sw_bwd(x: str, y: str, S: array, B: array) -> (str, str, float)\n", + "\n", + "qui effectue l'etape de retour de l'algorithme de Smith-Waterman etant donné les séquences $x$ et $y$ et les matrices de score $S$ et de retour $B$. La fonction retourne un tuple contenant les alignements des séquences x et y et le score de l'alignement." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ij9JDpBm_UZ7" + }, + "outputs": [], + "source": [ + "#Votre code ici" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kwmxg2dxAiwS" + }, + "source": [ + "Q3. Vous pouvez tester votre code en utilisant le schéma d'évaluation suivant :" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JUtYRFTBAwwZ" + }, + "outputs": [], + "source": [ + "cmap = {\"A\": 0, \"T\": 1, \"G\": 2, \"C\": 3}\n", + "m = np.array([[1, -0.5, -0.5, -0.5],\n", + " [-0.5, 1, -0.5, -0.5],\n", + " [-0.5, -0.5, 1, -0.5],\n", + " [-0.5, -0.5, -0.5, 1]])\n", + "go = 0\n", + "ge = 0.5" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eMGh4K5aIFxE" + }, + "source": [ + "* $SW('TCGC', 'CTTAG')$ retourne un score de $1.5$ à la position $(3,5)$ et l'alignement" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 60 }, + "id": "joHNwJ9AIf6F", + "outputId": "a9206810-a083-4d86-8b14-38183f1dd80c" + }, + "outputs": [ { - "cell_type": "code", - "source": [ - "HTML(\"
x:TCG
y:TAG
\")" + "data": { + "text/html": [ + "
x:TCG
y:TAG
" ], - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 60 - }, - "id": "joHNwJ9AIf6F", - "outputId": "a9206810-a083-4d86-8b14-38183f1dd80c" - }, - "execution_count": null, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "" - ], - "text/html": [ - "
x:TCG
y:TAG
" - ] - }, - "metadata": {}, - "execution_count": 18 - } + "text/plain": [ + "" ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "HTML(\"
x:TCG
y:TAG
\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JJlU5yvZI43D" + }, + "source": [ + "* $SW(AY678264^*, OQ870305^*)$ retourne un score de $342.1$ à la position $(708,717)$ et l'alignement" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 80 }, + "id": "HUELvWKMFtIO", + "outputId": "976bab6f-f1fc-4c5a-c69c-8de02fc838d0" + }, + "outputs": [ { - "cell_type": "markdown", - "source": [ - "* $SW(AY678264^*, OQ870305^*)$ retourne un score de $342.1$ à la position $(708,717)$ et l'alignement" - ], - "metadata": { - "id": "JJlU5yvZI43D" - } - }, - { - "cell_type": "code", - "source": [ - "from IPython.display import HTML\n", - "HTML(\"
x:ATGGTGAGCAAGGGCGAGGAGGATAACATGGCCATCATCAAGGAGTTCATGCGCTTCAAGGTGC-A-CATGGAGGGCTCCGTGAACGGCCACGAGTTCGAGATCGAG---GGCGAGGGCGAGGGC--CGCC-CCTACGAGGGCACCCAGACCGC-CAAGCTGAAGGTG-ACCA-AGG---G-TGGCC---CCCT-GCCCTTCGCCT-GGGA-CATCCTGTCC--C--C-T-CAGTTCATGT-A-CGGCT-CCAAGGCCTACGTG-A--AGCAC--C--C--C--G-CCGACATCCCCG-A--CTAC-T--TGAAGCTG-TCCTTC--C--C-----CGA-GG--GCTTCAAGTGGGAGCG-CGTGATGAACTTCGAGGACGGCGGCGTGGTG-ACCG--T-GA-C-CCAGGAC-TC--CTCCCTGCAGGACGGCGAGTTCATCTACAAGGTG---AAGCTGCGCGGCACCAACTTCCCCT-CCGACGGCCCCGTA-ATGCA-GAAGAAGACCATGGGCTG--GGA-GGCCTCCTCCGAGCGGATGTACCCCGAGGA-CGGCGCC-CTGAAGGGCGAGATCAAGCAGA-GGCTGAAGC-TGAAGGACGGCGGCCACTACGACGCTGAGGTCAAGACCACCTACA-AGGCCAAGAAG-CCCGTGCAGCTGCCCGGC-GCCTACAACGTCAACATCAAGT-TG----GA-CATCACCTCCCACAACGAGGA-CTAC-A-C-CA---T-C-G-TGGAACAGTACG-AACGCGCCGAGGGCCGCCACTCCAC-CGGCGGCATGGACGAGCTGTACAAG
y:ATGGTGAGCAAGGGCGAGGA-G----C-T-G--TTCA-C-CGG-GGTGGTGCCCATCCTGGT-CGAGC-TGGACGGCGACGTAAACGGCCACAAGTTC-AG--CGTGTCCGGCGAGGGCGAGGGCGATGCCACCTAC---GGCAAGCTGACC-CTGAAG-TTCATTTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCC-AC-CCTCGTGACCACCCTGACCTACGGCGTGCAGTGC-T-TCAGCCGCTACCCCGACC-ACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGC-GCACCATCTTCTTCAAGGACGACGGCAACTACAAGA-CCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGC-A--ACATC--C-TGGGGCACAAGCTG-G-AGTA-CAACTACAACAGCC-ACAACGTC-TATAT-CATG--GCCGA-CAA--GCAGAAGAACGG-CA--T-C-A-AGG-TGAACTTC-AAGATC--CGCCAC--AA---C---ATCGAG--GACGGC---AGCGTGCAGCTCGCCGACCACTACCA-GC--A-G--AACACC-CC--CATCGGCGACG--GCCCCGTGCTGCTGCCCGACAACC-ACTACCTGAGCACCCAGTCCGCCCTGAGCAA-A-GACCC-CAACGAGAAGC-GCGATCACATGGTCCTGCTGG---AGTTCGTGAC-CGCC----GCCGGGA-T-CACTC-TCGGCATGGACGAGCTGTACAAG
\")" + "data": { + "text/html": [ + "
x:ATGGTGAGCAAGGGCGAGGAGGATAACATGGCCATCATCAAGGAGTTCATGCGCTTCAAGGTGC-A-CATGGAGGGCTCCGTGAACGGCCACGAGTTCGAGATCGAG---GGCGAGGGCGAGGGC--CGCC-CCTACGAGGGCACCCAGACCGC-CAAGCTGAAGGTG-ACCA-AGG---G-TGGCC---CCCT-GCCCTTCGCCT-GGGA-CATCCTGTCC--C--C-T-CAGTTCATGT-A-CGGCT-CCAAGGCCTACGTG-A--AGCAC--C--C--C--G-CCGACATCCCCG-A--CTAC-T--TGAAGCTG-TCCTTC--C--C-----CGA-GG--GCTTCAAGTGGGAGCG-CGTGATGAACTTCGAGGACGGCGGCGTGGTG-ACCG--T-GA-C-CCAGGAC-TC--CTCCCTGCAGGACGGCGAGTTCATCTACAAGGTG---AAGCTGCGCGGCACCAACTTCCCCT-CCGACGGCCCCGTA-ATGCA-GAAGAAGACCATGGGCTG--GGA-GGCCTCCTCCGAGCGGATGTACCCCGAGGA-CGGCGCC-CTGAAGGGCGAGATCAAGCAGA-GGCTGAAGC-TGAAGGACGGCGGCCACTACGACGCTGAGGTCAAGACCACCTACA-AGGCCAAGAAG-CCCGTGCAGCTGCCCGGC-GCCTACAACGTCAACATCAAGT-TG----GA-CATCACCTCCCACAACGAGGA-CTAC-A-C-CA---T-C-G-TGGAACAGTACG-AACGCGCCGAGGGCCGCCACTCCAC-CGGCGGCATGGACGAGCTGTACAAG
y:ATGGTGAGCAAGGGCGAGGA-G----C-T-G--TTCA-C-CGG-GGTGGTGCCCATCCTGGT-CGAGC-TGGACGGCGACGTAAACGGCCACAAGTTC-AG--CGTGTCCGGCGAGGGCGAGGGCGATGCCACCTAC---GGCAAGCTGACC-CTGAAG-TTCATTTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCC-AC-CCTCGTGACCACCCTGACCTACGGCGTGCAGTGC-T-TCAGCCGCTACCCCGACC-ACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGC-GCACCATCTTCTTCAAGGACGACGGCAACTACAAGA-CCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGC-A--ACATC--C-TGGGGCACAAGCTG-G-AGTA-CAACTACAACAGCC-ACAACGTC-TATAT-CATG--GCCGA-CAA--GCAGAAGAACGG-CA--T-C-A-AGG-TGAACTTC-AAGATC--CGCCAC--AA---C---ATCGAG--GACGGC---AGCGTGCAGCTCGCCGACCACTACCA-GC--A-G--AACACC-CC--CATCGGCGACG--GCCCCGTGCTGCTGCCCGACAACC-ACTACCTGAGCACCCAGTCCGCCCTGAGCAA-A-GACCC-CAACGAGAAGC-GCGATCACATGGTCCTGCTGG---AGTTCGTGAC-CGCC----GCCGGGA-T-CACTC-TCGGCATGGACGAGCTGTACAAG
" ], - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 80 - }, - "id": "HUELvWKMFtIO", - "outputId": "976bab6f-f1fc-4c5a-c69c-8de02fc838d0" - }, - "execution_count": null, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "" - ], - "text/html": [ - "
x:ATGGTGAGCAAGGGCGAGGAGGATAACATGGCCATCATCAAGGAGTTCATGCGCTTCAAGGTGC-A-CATGGAGGGCTCCGTGAACGGCCACGAGTTCGAGATCGAG---GGCGAGGGCGAGGGC--CGCC-CCTACGAGGGCACCCAGACCGC-CAAGCTGAAGGTG-ACCA-AGG---G-TGGCC---CCCT-GCCCTTCGCCT-GGGA-CATCCTGTCC--C--C-T-CAGTTCATGT-A-CGGCT-CCAAGGCCTACGTG-A--AGCAC--C--C--C--G-CCGACATCCCCG-A--CTAC-T--TGAAGCTG-TCCTTC--C--C-----CGA-GG--GCTTCAAGTGGGAGCG-CGTGATGAACTTCGAGGACGGCGGCGTGGTG-ACCG--T-GA-C-CCAGGAC-TC--CTCCCTGCAGGACGGCGAGTTCATCTACAAGGTG---AAGCTGCGCGGCACCAACTTCCCCT-CCGACGGCCCCGTA-ATGCA-GAAGAAGACCATGGGCTG--GGA-GGCCTCCTCCGAGCGGATGTACCCCGAGGA-CGGCGCC-CTGAAGGGCGAGATCAAGCAGA-GGCTGAAGC-TGAAGGACGGCGGCCACTACGACGCTGAGGTCAAGACCACCTACA-AGGCCAAGAAG-CCCGTGCAGCTGCCCGGC-GCCTACAACGTCAACATCAAGT-TG----GA-CATCACCTCCCACAACGAGGA-CTAC-A-C-CA---T-C-G-TGGAACAGTACG-AACGCGCCGAGGGCCGCCACTCCAC-CGGCGGCATGGACGAGCTGTACAAG
y:ATGGTGAGCAAGGGCGAGGA-G----C-T-G--TTCA-C-CGG-GGTGGTGCCCATCCTGGT-CGAGC-TGGACGGCGACGTAAACGGCCACAAGTTC-AG--CGTGTCCGGCGAGGGCGAGGGCGATGCCACCTAC---GGCAAGCTGACC-CTGAAG-TTCATTTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCC-AC-CCTCGTGACCACCCTGACCTACGGCGTGCAGTGC-T-TCAGCCGCTACCCCGACC-ACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGC-GCACCATCTTCTTCAAGGACGACGGCAACTACAAGA-CCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGC-A--ACATC--C-TGGGGCACAAGCTG-G-AGTA-CAACTACAACAGCC-ACAACGTC-TATAT-CATG--GCCGA-CAA--GCAGAAGAACGG-CA--T-C-A-AGG-TGAACTTC-AAGATC--CGCCAC--AA---C---ATCGAG--GACGGC---AGCGTGCAGCTCGCCGACCACTACCA-GC--A-G--AACACC-CC--CATCGGCGACG--GCCCCGTGCTGCTGCCCGACAACC-ACTACCTGAGCACCCAGTCCGCCCTGAGCAA-A-GACCC-CAACGAGAAGC-GCGATCACATGGTCCTGCTGG---AGTTCGTGAC-CGCC----GCCGGGA-T-CACTC-TCGGCATGGACGAGCTGTACAAG
" - ] - }, - "metadata": {}, - "execution_count": 15 - } + "text/plain": [ + "" ] - }, - { - "cell_type": "markdown", - "source": [ - "# Exercice 3 : Distribution des scores d’alignement pour des séquences aléatoires\n", - "\n", - "Pour tester si un alignement reflète une réelle similarité biologique, on va évaluer la distribution des scores d’alignement pour des paires de séquences aléatoires." - ], - "metadata": { - "id": "Q5jVeLfgMMtA" - } - }, - { - "cell_type": "markdown", - "source": [ - "Q1. En considérant deux séquences aléatoires de même taille N, où chaque nucléotide apparaît avec une probabilité uniforme de ¼, calculer le score moyen attendu pour une superposition sans trou dans le cas où une identité vaut +1 et une différence vaut 0." - ], - "metadata": { - "id": "6xyXw0HsMQGf" - } - }, - { - "cell_type": "markdown", - "source": [ - "```markdown\n", - "Votre réponse ici\n", - "```" - ], - "metadata": { - "id": "meF18gt-Mhcn" - } - }, - { - "cell_type": "markdown", - "source": [ - "Q2. La question précédente peut se resoudre analytiquement car on ne considère pas de trou. Pour étendre le résultat precedent à un alignement avec trous, on va se baser sur la simulation de séquences aleatoires.\n", - "\n", - "Générez $R$ paires de séquences aléatoires de tailles $N$ avec des probabilitées uniformes d'apparition de nucléotides $p_A = p_T = p_G = p_C = $ ¼. Affichez sous forme de violinplots les distribution des scores d'alignements entre chaque paire, obtenu par :\n", - " 1. un alignement sans trou (cf. Q1) ;\n", - " 2. un alignement local via Smith-Waterman (utilisez le code de l'exercice précédent)\n", - "\n", - "Utilisez le schéma d'évaluation suivant :" - ], - "metadata": { - "id": "fP5_mHnYMkNI" - } - }, - { - "cell_type": "code", - "source": [ - "rmap = {\"A\": 0, \"T\": 1, \"G\": 2, \"C\": 3}\n", - "sigma = np.array([[1, -0.5, -0.5, -0.5],\n", - " [-0.5, 1, -0.5, -0.5],\n", - " [-0.5, -0.5, 1, -0.5],\n", - " [-0.5, -0.5, -0.5, 1]])\n", - "go =0\n", - "ge = 0.5" - ], - "metadata": { - "id": "akUVqotnOLkH" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "source": [ - "#Votre code ici" - ], - "metadata": { - "id": "UX0afNaqOVZ2" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "source": [ - "Q3. Qu'observez-vous ?" - ], - "metadata": { - "id": "UNn9fUuXO4Le" - } - }, - { - "cell_type": "markdown", - "source": [ - "```markdown\n", - "Votre réponse ici\n", - "```" - ], - "metadata": { - "id": "dSQEl0XXO8IG" - } - }, - { - "cell_type": "markdown", - "source": [ - "Q4. Quelle conclusion peut-on en tirer sur la significativité d'un alignement ?" - ], - "metadata": { - "id": "xHfVXpQhf15n" - } - }, - { - "cell_type": "markdown", - "source": [ - "```markdown\n", - "Votre réponse ici\n", - "```" - ], - "metadata": { - "id": "5KjhEeHDgDns" - } + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" } - ] -} \ No newline at end of file + ], + "source": [ + "from IPython.display import HTML\n", + "HTML(\"
x:ATGGTGAGCAAGGGCGAGGAGGATAACATGGCCATCATCAAGGAGTTCATGCGCTTCAAGGTGC-A-CATGGAGGGCTCCGTGAACGGCCACGAGTTCGAGATCGAG---GGCGAGGGCGAGGGC--CGCC-CCTACGAGGGCACCCAGACCGC-CAAGCTGAAGGTG-ACCA-AGG---G-TGGCC---CCCT-GCCCTTCGCCT-GGGA-CATCCTGTCC--C--C-T-CAGTTCATGT-A-CGGCT-CCAAGGCCTACGTG-A--AGCAC--C--C--C--G-CCGACATCCCCG-A--CTAC-T--TGAAGCTG-TCCTTC--C--C-----CGA-GG--GCTTCAAGTGGGAGCG-CGTGATGAACTTCGAGGACGGCGGCGTGGTG-ACCG--T-GA-C-CCAGGAC-TC--CTCCCTGCAGGACGGCGAGTTCATCTACAAGGTG---AAGCTGCGCGGCACCAACTTCCCCT-CCGACGGCCCCGTA-ATGCA-GAAGAAGACCATGGGCTG--GGA-GGCCTCCTCCGAGCGGATGTACCCCGAGGA-CGGCGCC-CTGAAGGGCGAGATCAAGCAGA-GGCTGAAGC-TGAAGGACGGCGGCCACTACGACGCTGAGGTCAAGACCACCTACA-AGGCCAAGAAG-CCCGTGCAGCTGCCCGGC-GCCTACAACGTCAACATCAAGT-TG----GA-CATCACCTCCCACAACGAGGA-CTAC-A-C-CA---T-C-G-TGGAACAGTACG-AACGCGCCGAGGGCCGCCACTCCAC-CGGCGGCATGGACGAGCTGTACAAG
y:ATGGTGAGCAAGGGCGAGGA-G----C-T-G--TTCA-C-CGG-GGTGGTGCCCATCCTGGT-CGAGC-TGGACGGCGACGTAAACGGCCACAAGTTC-AG--CGTGTCCGGCGAGGGCGAGGGCGATGCCACCTAC---GGCAAGCTGACC-CTGAAG-TTCATTTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCC-AC-CCTCGTGACCACCCTGACCTACGGCGTGCAGTGC-T-TCAGCCGCTACCCCGACC-ACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGC-GCACCATCTTCTTCAAGGACGACGGCAACTACAAGA-CCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGC-A--ACATC--C-TGGGGCACAAGCTG-G-AGTA-CAACTACAACAGCC-ACAACGTC-TATAT-CATG--GCCGA-CAA--GCAGAAGAACGG-CA--T-C-A-AGG-TGAACTTC-AAGATC--CGCCAC--AA---C---ATCGAG--GACGGC---AGCGTGCAGCTCGCCGACCACTACCA-GC--A-G--AACACC-CC--CATCGGCGACG--GCCCCGTGCTGCTGCCCGACAACC-ACTACCTGAGCACCCAGTCCGCCCTGAGCAA-A-GACCC-CAACGAGAAGC-GCGATCACATGGTCCTGCTGG---AGTTCGTGAC-CGCC----GCCGGGA-T-CACTC-TCGGCATGGACGAGCTGTACAAG
\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q5jVeLfgMMtA" + }, + "source": [ + "# Exercice 3 : Distribution des scores d’alignement pour des séquences aléatoires\n", + "\n", + "Pour tester si un alignement reflète une réelle similarité biologique, on va évaluer la distribution des scores d’alignement pour des paires de séquences aléatoires." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6xyXw0HsMQGf" + }, + "source": [ + "Q1. En considérant deux séquences aléatoires de même taille N, où chaque nucléotide apparaît avec une probabilité uniforme de ¼, calculer le score moyen attendu pour une superposition sans trou dans le cas où une identité vaut +1 et une différence vaut 0." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "meF18gt-Mhcn" + }, + "source": [ + "```markdown\n", + "Votre réponse ici\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fP5_mHnYMkNI" + }, + "source": [ + "Q2. La question précédente peut se resoudre analytiquement car on ne considère pas de trou. Pour étendre le résultat precedent à un alignement avec trous, on va se baser sur la simulation de séquences aleatoires.\n", + "\n", + "Générez $R$ paires de séquences aléatoires de tailles $N$ avec des probabilitées uniformes d'apparition de nucléotides $p_A = p_T = p_G = p_C = $ ¼. Affichez sous forme de violinplots les distribution des scores d'alignements entre chaque paire, obtenu par :\n", + " 1. un alignement sans trou (cf. Q1) ;\n", + " 2. un alignement local via Smith-Waterman (utilisez le code de l'exercice précédent)\n", + "\n", + "Utilisez le schéma d'évaluation suivant :" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "akUVqotnOLkH" + }, + "outputs": [], + "source": [ + "rmap = {\"A\": 0, \"T\": 1, \"G\": 2, \"C\": 3}\n", + "sigma = np.array([[1, -0.5, -0.5, -0.5],\n", + " [-0.5, 1, -0.5, -0.5],\n", + " [-0.5, -0.5, 1, -0.5],\n", + " [-0.5, -0.5, -0.5, 1]])\n", + "go =0\n", + "ge = 0.5" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "UX0afNaqOVZ2" + }, + "outputs": [], + "source": [ + "#Votre code ici" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UNn9fUuXO4Le" + }, + "source": [ + "Q3. Qu'observez-vous ?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dSQEl0XXO8IG" + }, + "source": [ + "```markdown\n", + "Votre réponse ici\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xHfVXpQhf15n" + }, + "source": [ + "Q4. Quelle conclusion peut-on en tirer sur la significativité d'un alignement ?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5KjhEeHDgDns" + }, + "source": [ + "```markdown\n", + "Votre réponse ici\n", + "```" + ] + } + ], + "metadata": { + "colab": { + "authorship_tag": "ABX9TyNSXnqaXAUgZK9rmJ1TWbGo", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.9" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}