Note: There are 2 paths through these tutorials, one with the FIU HPCC and one without.
Session 3: Working with Files
Last time we learned how to navigate a directory structure using shell commands. This session will focus on using shell commands to manipulate files.
Getting the file
You have 3 options for this tutorial:
- the Terminal Playground at sandbox.bio; 2) the Terminal application on your Mac/Linux OS; 3) PowerShell on your PC/Windows OS.
We are analyzing a file of DNA sequence from a recent Illumina sequencing runrun (this is only part of the file).
@SN747:483:C77GWACXX:1:1102:2015:10148 1:N:0:AGGCAGAA+ACNNNATA TGCAGGAAGGCGTTTCGTAAGAATGAAAAGAGTTGCTCCACAATGTGTCACGTCCAAATTGAAGAAGAACTGAAACACATACAAGTACATTATTTTGTAAT + @@CFDFDFHHGHHJIHHIIJJGIIJIJJJIIJGGIIJJJJJIIIIGIJJJJJHIJJIJIGHGHHHHHHFFFFFEC@CCBCCCCDD@ACDFECEEEEDDEDA @SN747:483:C77GWACXX:1:1102:0:0 1:N:0:AGGCAGAA+ACNNNATA TGCAGGGGGTGAGAATGACGAAAGGAAAGAAGGGAAAGATCGCCCGACCGAATGGACAGTACAAGCTTGTCGATATAGTTCCTGTCTCTTATACACATCTC + ??@D1BDAD@D?FG?EHIGIIGGBHIHIIEGC=F>CB>@CCCCB@CC8@B?BB0>>:@>>@>C>CCCC:CCC@CC@@CA @SN747:483:C77GWACXX:1:1102:0:0 1:N:0:AGGCAGAA+ACNNCATA TNCAGGTCCTTGTGCTCCTTGATATCCTTCTCTTCCTGGTCGCCCAACTGGACCACGCGGATTGATATACCTATTTGTATTCTGTCCGGGCGGTCCACTTG + @#14ADBDFDDADECGB@HGHEHEHGG@CGECHHG<31:DBGG)0?3BD;CGGHIIID<;@CCAC@C:AA5;>@CDDDDECC?099BB5>B>(+:( @SN747:483:C77GWACXX:1:1102:0:0 1:N:0:TGGCAGAA+ACNNCATA GTGGATCCTGTGCTGCCTCGCCTTCGCCGGCGCCGCCCACGGTTGCGGCTCTCCCGCCATCTCGCCCGTCATCACCGGTTACTCTCGTATCGTGAACGGCG + 1:1+=2A=:DDFBB@+2+33@188<C))0)0-'-54'',59?,52(,37-0>(+37-5-0((:8?;;;;B?<::>9<@<-&:@ABA@BBBB0<@A?((5>5 @SN747:483:C77GWACXX:1:1102:0:0 1:Y:0:AGGCAGAA+ACNNCATA ATGTATTCCGGGATGACTAGGTGTGGGGGGAGGGGCCGGCGGGTGGTGGCCTCCCGACTTGTCACTCGTCAGTATCGTGTTGCCTTGTTTCGTGCCCGGGG + (0022)))))(((2())))2))).((((()&&)&&&&&&&&)&)0&&(+(((+(()&&)+++((++(++(()((+((+((+((((+++(+2(((((+)&&) @SN747:483:C77GWACXX:1:1102:0:0 1:N:0:AGGCAGAA+ACNNCATA TGCAGGTGACCCAAATATGCAAATGCGGTGTAGTTGTGGAACGTCGCAGTTGCGAAGTGTCGCAGTTGCGAAATGTCGCGAAGCTCTCCACTTCCAAGATG + :;?AAD=BFHHFHHIIGHJIJIDHIJJJGHHGJJJCGHGIJJJIJIJIIGIHEHDEF;?@C?BDB;;>ABDD=?CDDB?@>BDBCCDDDDDDDDD@CCDDA @SN747:483:C77GWACXX:1:1102:0:0 1:N:0:AGGCAGAA+ACNNCATA GGATGACCTTGTAAGGGTTGGTATCTTCCACAGCGCTTCTTGTCATCTAAAACGGAAACCAAATGACTCAAATTCTCTGAAAGTGACCATTTGTCTCGCAG + :1=442AB==C?C?,+A:?:):<<AFFI?F:?F6@49?DDFF<FFBB<B:55@E9;6A6==?7;;@37;(66;6>@>3>>;;>@:>5,5,:5>AB50 @SN747:483:C77GWACXX:1:1102:0:0 1:N:0:AGGCAGAA+ACNNCATA TGCAGGAGAGAGAGATGCCATCGGCTCCTCTTCTCTCTATCTTCTCTTTTGAATTTTTGCAAAATGTGGAATGATATCCAATCTCCCTGTCTCTTATACAC + @@@A;A?BC?D<DB;EHIGCH@;DDH))93:B>??D<<<DFCHEEBD?FHBC88)8=F'=@C:@=.?>CCCC;7777.6;CC3>;(;5(>CCCCC<@35;> @SN747:483:C77GWACXX:1:1102:0:0 1:Y:0:AGGCAGAA+ACNNCATC TGCAGGAAACCGAGCGTCCAAAAATTCATCGCTTCTCATCCACAGCTTCACAACTGTCTCTTATACACATCTCCGAGCCCACGAGACAGGCAGAAATCTCG + 813=58((:=>((8>/<5=;>??(.3<3169(373>)))<>1)..;1)=>;??6)8=8>?@8=8=987((..6=93,'855;)&)05(+2(266(55:2:( @SN747:483:C77GWACXX:1:1102:0:0 1:N:0:AGGCAGAA+ACNNCATA TGCAGGTCACTACAGGAGGTCTCTGAAAAAAGAAATAACAATTAATTGTTTCTAGCTTTTTCCAAAGTAAATTGAAATTTGTCATGCTCTAAATAATATCT + [jfierst@u05 willis_10-8-15_demux]$ head -100 Undetermined_S0_L001_R1_001.fastq.AGGCAGAA+ACTGCATA @SN747:483:C77GWACXX:1:1102:2015:10148 1:N:0:AGGCAGAA+ACNNNATA TGCAGGAAGGCGTTTCGTAAGAATGAAAAGAGTTGCTCCACAATGTGTCACGTCCAAATTGAAGAAGAACTGAAACACATACAAGTACATTATTTTGTAAT + @@CFDFDFHHGHHJIHHIIJJGIIJIJJJIIJGGIIJJJJJIIIIGIJJJJJHIJJIJIGHGHHHHHHFFFFFEC@CCBCCCCDD@ACDFECEEEEDDEDA @SN747:483:C77GWACXX:1:1102:0:0 1:N:0:AGGCAGAA+ACNNNATA TGCAGGGGGTGAGAATGACGAAAGGAAAGAAGGGAAAGATCGCCCGACCGAATGGACAGTACAAGCTTGTCGATATAGTTCCTGTCTCTTATACACATCTC + ??@D1BDAD@D?FG?EHIGIIGGBHIHIIEGC=F>CB>@CCCCB@CC8@B?BB0>>:@>>@>C>CCCC:CCC@CC@@CA @SN747:483:C77GWACXX:1:1102:0:0 1:N:0:AGGCAGAA+ACNNCATA TNCAGGTCCTTGTGCTCCTTGATATCCTTCTCTTCCTGGTCGCCCAACTGGACCACGCGGATTGATATACCTATTTGTATTCTGTCCGGGCGGTCCACTTG + @#14ADBDFDDADECGB@HGHEHEHGG@CGECHHG<31:DBGG)0?3BD;CGGHIIID<;@CCAC@C:AA5;>@CDDDDECC?099BB5>B>(+:( @SN747:483:C77GWACXX:1:1102:0:0 1:N:0:TGGCAGAA+ACNNCATA GTGGATCCTGTGCTGCCTCGCCTTCGCCGGCGCCGCCCACGGTTGCGGCTCTCCCGCCATCTCGCCCGTCATCACCGGTTACTCTCGTATCGTGAACGGCG + 1:1+=2A=:DDFBB@+2+33@188<C))0)0-'-54'',59?,52(,37-0>(+37-5-0((:8?;;;;B?<::>9<@<-&:@ABA@BBBB0<@A?((5>5 @SN747:483:C77GWACXX:1:1102:0:0 1:Y:0:AGGCAGAA+ACNNCATA ATGTATTCCGGGATGACTAGGTGTGGGGGGAGGGGCCGGCGGGTGGTGGCCTCCCGACTTGTCACTCGTCAGTATCGTGTTGCCTTGTTTCGTGCCCGGGG + (0022)))))(((2())))2))).((((()&&)&&&&&&&&)&)0&&(+(((+(()&&)+++((++(++(()((+((+((+((((+++(+2(((((+)&&) @SN747:483:C77GWACXX:1:1102:0:0 1:N:0:AGGCAGAA+ACNNCATA TGCAGGTGACCCAAATATGCAAATGCGGTGTAGTTGTGGAACGTCGCAGTTGCGAAGTGTCGCAGTTGCGAAATGTCGCGAAGCTCTCCACTTCCAAGATG + :;?AAD=BFHHFHHIIGHJIJIDHIJJJGHHGJJJCGHGIJJJIJIJIIGIHEHDEF;?@C?BDB;;>ABDD=?CDDB?@>BDBCCDDDDDDDDD@CCDDA @SN747:483:C77GWACXX:1:1102:0:0 1:N:0:AGGCAGAA+ACNNCATA GGATGACCTTGTAAGGGTTGGTATCTTCCACAGCGCTTCTTGTCATCTAAAACGGAAACCAAATGACTCAAATTCTCTGAAAGTGACCATTTGTCTCGCAG + :1=442AB==C?C?,+A:?:):<<AFFI?F:?F6@49?DDFF<FFBB<B:55@E9;6A6==?7;;@37;(66;6>@>3>>;;>@:>5,5,:5>AB50 @SN747:483:C77GWACXX:1:1102:0:0 1:N:0:AGGCAGAA+ACNNCATA TGCAGGAGAGAGAGATGCCATCGGCTCCTCTTCTCTCTATCTTCTCTTTTGAATTTTTGCAAAATGTGGAATGATATCCAATCTCCCTGTCTCTTATACAC + @@@A;A?BC?D<DB;EHIGCH@;DDH))93:B>??D<<<DFCHEEBD?FHBC88)8=F'=@C:@=.?>CCCC;7777.6;CC3>;(;5(>CCCCC<@35;> @SN747:483:C77GWACXX:1:1102:0:0 1:Y:0:AGGCAGAA+ACNNCATC TGCAGGAAACCGAGCGTCCAAAAATTCATCGCTTCTCATCCACAGCTTCACAACTGTCTCTTATACACATCTCCGAGCCCACGAGACAGGCAGAAATCTCG + 813=58((:=>((8>/<5=;>??(.3<3169(373>)))<>1)..;1)=>;??6)8=8>?@8=8=987((..6=93,'855;)&)05(+2(266(55:2:( @SN747:483:C77GWACXX:1:1102:0:0 1:N:0:AGGCAGAA+ACNNCATA TGCAGGTCACTACAGGAGGTCTCTGAAAAAAGAAATAACAATTAATTGTTTCTAGCTTTTTCCAAAGTAAATTGAAATTTGTCATGCTCTAAATAATATCT + @@CFFFDFFHHFHJJJJJJCGIJJJJJIJEGIIIJIIJIJJIJJGIIJIJJJJGJHIJJJIJIIJJIHHHHGHHGBEFFFFFEDEEEEEDDCCACCDDEDC @SN747:483:C77GWACXX:1:1102:0:0 1:N:0:AGGCAGAA+ACNNCATA TGCAGGCACAAATGCTTGCCACTCTTGATGTCATGATGGGTGGACGAGTAGCTGAAGAACTCATTTTTGGAGAAGACAAAGTAACAACTGGAGCAGCCGAT +
'7;ACC>ACCC?;@CDCCBCDCDBCBDBDDDB @SN747:483:C77GWACXX:1:1102:0:0 1:Y:0:AGGCAGAA+ACNNCATA GATCGGAAGAGCACACGTCTACATACAAATACTCGGCCAGACTCAGCAATGCCTTCATTACTTGACAATTTATTGTTGTCCTTTTAATGTTAATCAAAATC + ;==+A<26(..5-;-(5(-55(,((((,:( @SN747:483:C77GWACXX:1:1102:0:0 1:N:0:AGGCAGAA+ACNNCATA TGCAGGCCGAATAAAGCAAGCCATCCGGATGCAGAAAAAAAGATTGGCTGATGATGCAGAATATAAAGAGCAGGAACAGTTCAGGGTCCGATGGGAAACCA + CCCFFFFFHDDHHIIJIIIHIIIJJJIJJIIHHIGHJIJJIIDHGGGHHHDEFFFFFECEEECDCEECCCDDDDDEHEHEFFFDDCDCEDCDDDDDDEDDACA:>BBCCCCCDDCD?DDB?@CCA9BB& @SN747:483:C77GWACXX:1:1102:0:0 1:Y:0:AGGCAGAA+ACNNCATC GTCGTCGTCTTTCCATGTGCTTGTTGACTGTGCTTCGGTGAGTCGTGGCAGGTACCTTGGGAGCCTTCCGGATTGGTGAAGTCCACAGGGGGGGGGACGGG + 11122<7?CA=A2<,<,**0/7;=)(5()/.77).((-',)).(,'','((,(,,,5(,,(,297&00&5)&))&) @SN747:483:C77GWACXX:1:1102:0:0 1:Y:0:AGGCAGAA+ACNNCATA CCCGACCGCTAGCTCGACACCGTGGAGGACTCAAGTGGGCATCTAGGGCGTAAGGAGGCGGGGACGTGGGGGTCGCTTGCAAGCAGCTGGAGCCGCGTGAA + 1++++))))))0+++)))))0))))((((())))))/)(().))...(.'''(,((('('&&&))&&((&&)&0)5&(+(((((23(((((++(&&)&)(+ @SN747:483:C77GWACXX:1:1102:0:0 1:N:0:AGGCAGAA+ACNNCATA GATCGGAAGAGCACACGTCTGAACTCCAGTCACCCGTCCCGATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAATGTACGTGCGTGATAGGGGGGTCTAT + @<@FDFFFGHGHHHEHGHIIIGIGHEHIJIA>>66;>='20&&+(((+((((&2)(+(+((&&))&+(( @SN747:483:C77GWACXX:1:1102:0:0 1:Y:0:AGGCAGAA+ACNNTATA GATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTTCCGTATCTCGTATGCCGTCTTCTGCTTGGACAAAACTAGACAGTTTGTATGAGGCATGCCTAAC + =+1AAAA<=AA*:?B>4*?*:?A80)=>?<=7=7=>==:AA1=.)73.)-))).)6(;((.-((;3;?9(,(,(((((,(+(+(+ @SN747:483:C77GWACXX:1:1102:0:0 1:N:0:AGGCAGAA+ACNNNATA TGCAGGCGGCGCAGTCCATGCCGCTGACTTTCCAGCTATAGCGGGAGCCGGAGACGTTTTCAGAGAGCGTTGGCGAGCTGGAACATGCGCCGTCGCAGCAA + ?@?+4BDDHDHGHJGIIJIJJGJJJJJJJJJJGHHHHHHDFEFDDDDDBDDDDDBB?@CBCCACDDBBDBDBDDDBDBDDDBDD?CCDDDDB0BBBC @SN747:483:C77GWACXX:1:1102:0:0 1:N:0:AGGCAGAA+ACNNCATA TGCTGGCTGTCTATTTTGTTTGTAACTGGAATTTTAAGGTCCATAGAACAAAGGAAAATGGGCGGTCTGTATTGAACATTATAGTCAAAACAATGTGAAAA + ?@@+==DDDDFDFHHIJGGGHFCDHEEGHGGEHIJGIHJBGEGIJJJJJJGHIFGFEGCEHGIIF?CCCCCDDDD>CCDDDEECCDEACDDDBCCACDEDD @SN747:483:C77GWACXX:1:1102:0:0 1:N:0:AGGCAGAA+ACNNCATA CCGTCGCCTCCAAAAGGAAGTGCGCCCGAGACCGGACGAGGCGGCGGCGCTCAAAGAGCTCAGCGTCGGCCCTACGCCCCTCGGCCACACCGCGAAGCCGG + 1114=0)20A=)('-'--73;;/3''',,'59(5',3&0+(+3+88:8=3((+38;83;?:(8:: @SN747:483:C77GWACXX:1:1102:0:0 1:Y:0:AGGCAGAA+ACNNCATA GGCAGCCTTATAACCGCCAACAAAGTCCACACAGCCGATGCCGAACCAGTCCGACCCATCGACTATTTGATCCCACACTCGCTGTCTCTTATACACATCTC + (-7==(()2>11>14?<<(9(@<1)29((.2(:)00<.3<)0))(-5=-'''''3',((,,++((+58(((((((((+((+((+(((+&(02))&&)))&(+(((+( In order to create a file we will need to paste the sequences above into your text editor. My favorite is vim. It has a lot of functionality but we will use it in the most simple way to create our scripts. Type $ vi file.fastq Your window should now have a lot of blue tildes in a single column. You are inside your file and you can edit it. Type 'i' for 'insertion mode' and you should see --INSERT-- at the bottom of your screen. Copy the sequences above and paste them into your terminal. In order to run our script we need to save our file and exit vi. We do this by hitting the escape key to move from insertion mode to command mode, typing the save command and typing the exit or quit command. Type :wq You should now be back at your prompt. Check your file exists by typing $ ls to list your directory contents. Check the file size with $ du -h file.fastq What does "du" stand for? What is the -h? Try $ man du and remember to use ‘q’ when you are ready to quit looking at the manual. Look inside the file with $ more file.fastq In order to quit "more," type $ q and you should return to the prompt. You can also use *less*, *head*, and *tail*. Play around with these a bit, for example $ less file.fastq Hint: look at the bottom of your screen when you try ‘more’ or ‘less’. 'More' and 'less' are both terminal pagers which means we use them to view things on our terminal. The key difference is that with less you can go forwards and backwards while with more you can only go forwards. Then try $ head file.fastq $ head -1 file.fastq $ head -10 file.fastq $ tail file.fastq $ tail -2 file.fastq This is a *fastq* file. It has four lines for each DNA sequencing read which are 1) the sequence read identifier; 2) the DNA sequence (or read); 3) "+" and 4) Quality scores for each base. We will go into more depth about .fastq files later but for now we will use this file to learn more shell commands. We typically use '.fastq' - note the 'dot' in the front there- to denote that this is a file type. Count the number of unique lines in the file with 'wc' $ wc file.fastq What is the output? Try $ man wc to understand the columns in your output. Now try $ wc -l file.fastq where wc -l = "word count by line." How many lines do you have? How many actual DNA sequences do you have? (hint: remember that each DNA sequence gets 4 lines!) Again, wc is a very powerful command, so take some time to understand it. Type $ man wc and read through the manual pages. How can you use wc to find the sequence read length (tell how many characters are in each DNA sequence)? Hint: it is easiest if you combine wc with another command. Double hint: there are always many different ways to do things in Unix and there are many correct answers to this question. For example, my solution would be: $ head -2 file.fastq | tail -1 | wc -c What is this doing? Try to go through each command individually and understand the pieces of the operation. For example, start with $ head -2 file.fastq then $ head -2 file.fastq | tail -1 then $ head -2 file.fastq | tail -1 | wc -c We are going to explore the standard shell commands cp (copy) and mv (move). Both commands take the syntax **operation - from location 1- to location 2**. To explore these, make a new directory $ mkdir ./that_directory and move your file to it $ mv file.fastq ./that_directory Now type $ ls What happens? Navigate to your new directory and try again $ cd that_directory $ ls Did you find file.fastq? You can also create a copy of your file with a different name $ cp file.fastq file_copy.fastq or with the same name but in a different directory $ cp file.fastq ../ What happens? **Searching within a file** We can search in a few different ways. Today we will learn about 'grep' which stands for **g**lobally search a **r**egular **e**xpression and **p**rint. In general 'grep' is a command that will search for a specific pattern. We call these patterns 'regular expressions' because we can use different characters and syntax to specify what we want to search for. We are not going to go too deeply into grep (for that you will need to take Computational Biology next semester!) but we will go through a few fun features here. Illumina sequence reads receive a 'chastity' mark that tells us about their quality. It's a little confusing because it tells us if they failed the filter or not so 'N' means no, did not fail filter and 'Y' means yes, did fail filter. Our 'Y' sequences are actually our bad ones and our 'N' sequences are our good ones. This filter mark comes at the end of the read identifier and we can search for it with grep $ grep ':Y:' file.fastq Here, grep has searched for this colon-Y-colon pattern in your file.fastq and you should see this pattern colored in red. On ASC grep will by default color these results for you but it will not on other systems. If you are doing this on your own computer you can add the flag --color=always to get that same color $ grep ':Y:' file.fastq --color=always You can see the colors very well with a longer string. For example, how many reads have a 10-bp adenine homopolymer? We can explore like this $ grep 'AAAAAAAAAA' file.fastq We can also use extended regular expressions (the -E flag below) to find our 10-bp adenine homopolymer like this $ grep -E '([A]{10})' file.fastq Learning regular expressions is simply learning how to build up these types of patterns. We can use also grep to get some information about our file. We can count how many sequence reads did not pass chastity like this $ grep -c ':Y:' file.fastq Grep should now be returning a count of chastity-failers. We can also search for lines that do not contain our pattern with -v. For example, every sequence read will get a 'Y' or 'N'. Count your chastity-failers like this $ grep -c ':Y:' file.fastq and your chastity-passers like this $ grep '^@HWI' file.fastq | grep -c -v ':Y:' This should be equivalent to $ grep -c ':N:' file.fastq We can also ask grep to return the pieces of our file below and above our desired pattern. This is very useful in a multi-line file format like our .fastq. For example, let's say we want to find the read information for everyone of our 10-bp adenine homopolymers. We can ask grep with -A which delivers lines after our pattern or -B which delivers lines before our pattern. For our 10-bp adenine homopolymers it looks like this $ grep -B 1 -E '([A]{10})' file.fastq And now you should see the read identifer before each 10-bp homopolymer. Grep will put a '--' line inbetween each of your matched blocks of text. **Streams, pipes and redirection** First we will make a small file to operate on before doing analyses on our large file. Type $ head -10 file.fastq > small_file.fastq You have now created a small file for testing. Count the files in your original file.fastq and your new small_file.fastq like this $ wc -l file.fastq $ wc -l small_file.fastq Type $ cat small_file.fastq What happens? The "cat" command will concatenate files. Here, we only give it one file and so it concatenates that file with our standard output (stdout), which is our terminal. If we give it two files we can stick them together, end to end. Try $ cat small\_file.fastq small\_file.fastq > new.fastq What happens (Hint: Check how many lines are in new.fastq with) $ wc -l new.fastq Vertical concatenation happens with "cat" and horizontal concatenation happens with "paste." For example, we often want to use data in a column format. Type $ cat small_file.fastq | paste - - - - > column.fastq Look at your file $ less column.fastq What happened? "cat" is concatenating your file.fastq to the terminal (stdout), and the "|" character pipes the output of one command to another, here from "cat" to "paste." This is really one of the most powerful things about UNIX, the ability to both pipe the output of one command into another, and simple file redirection. Here, the redirect puts the stdout of paste into a file called column.fastq. Look at the file to see how it is formatted. We can easily manipulate the columns of this file with the "cut" command. Try $ cut -f 1 column.fastq Now try $ man cut What is the "-f" doing? What happens if you do 2 or 3? Can you combine them (for example, 1-2, 1-3)? Try piping multiple commands $ cat column.fastq | cut -f 1 | sort The Illumina read identifier is a physical location in the machine that tells us: sequencing machine:run(numerical):run(identifier):lane:tile:x-coordinate:y-coordinate Is your data from one tile or multiple tiles? A very useful command is ‘uniq’ which stands for unique. You can use it to streamline multiples down to a single instance or to count instances with the -c flag. One of the big problems in bioinformatics is that different people work on data generation and analysis and the molecular people know little about what the bioinformatics people are doing and vice versa. For example, a common molecular protocol is to add a short tag or ‘barcode’ to a molecular sample so you can sequence multiple samples together. We can use these barcodes after sequencing to separate samples. The barcode is typically the first 5-7 bp of the sequencing read. We can check for barcodes like this: $ cat small_file.fastq | paste - - - - | cut -f 2 | cut -c1-6 | sort | uniq -c | sort -nr | head -20 Take some time to build this command piece by piece an play around with the flags (options) so you can really understand what’s going on here. Do you think this sample has barcodes? In order to answer this question we first need to construct an expectation of what our sample would look like without barcodes. How many sequences do you have? We are examining the first six characters for a barcode-like pattern and we have 4 bases possible. A purely neutral probability of any one base in any one of the first six positions is then 1/4 and the probability of a string of any six bases is multiplicative or 1/4 * 1/4 * 1/4 * 1/4 * 1/4 * 1/4 For example, the probability of AAAAAA is equivalent to the equation above. Given that expectation, how many of any one string of six characters do you expect in your file? You can calculate like this number of DNA sequences * probability of any string of six This is your neutral expectation for any six-base sequence. Now, use the counting line I gave you above to count the frequency of the most common six nucleotide combinations in the begining of your sequences. Try this with the large file.fastq. It may take some time to get through the large file with everyone on the head node. In-class credit: 1) List the new commands you have learned this session and give a short (2-5 word) description of each.