Session 3: Working with Files
We will be logging onto the HPCC and doing some preliminary file manipulation. You will use your panther credentials (username and password) to connect.
Login at the FIU Panther cluster Web portal from a browser: https://hpcgui.fiu.edu Note: Mozilla Firefox has not been working for me lately and I have to connect via Safari.
Once you are on, go to the top where it says 'Clusters' and select 'Panther Shell Access.' It will open a terminal and ask you for your panther credentials again.
Congratulations! You are officially on the server.
We have a large (10Tb) directory on the FIU HPCC that we wll use for storing and working with our sequencing files. We will also use this directory for sharing data files.
Today are analyzing a file of DNA sequence from a recent Illumina HiSeq run (this is only part of the file). Sequencing files are very large so we compress them to save space. The file should be in our course directory, /home/data/jfierst_classroom.
Navigate to this from your directory
$ cd /home/data/jfierst_classroom
Once you have located the directory you can copy the file into your home directory. Note: we are doing this today with a relatively small file to practice working with files. When you get your sequencing data your files will be LARGE and you will need to keep them in this common space.
Some other standard shell commands are cp (copy) and mv (move). Both commands take the syntax operation - from location 1- to location 2.
$ cp ./file.fastq.gz ~
Remember that ~ the 'tilde' is the location reference for your home directory. Now type
$ pwd
What happens? You are probably still in the classroom directory while your copied file is in your home directory.
Now we are ready to start learning how to work with files. Navigate back to your home directory
$ cd ../
or
$ cd ~
and check the file size with
$ du -h file.fastq.gz
What does "du" stand for? What is the -h? Try
$ man du
and remember to use ‘q’ when you are ready to quit looking at the manual.
First we need to uncompress or unzip the file.
$ gunzip file.fastq.gz
What is “gunzip”? Use
$ man gunzip
to examine the manual. You should see the name of the file change from having a '.fastq.gz' extension to '.fastq'
$ ls
Check the file size again
$ du -h file.fastq
Is it different from before? How much of a difference in size does the compression account for?
Look inside the file with
$ more file.fastq
In order to quit "more," type
$ q
and you should return to the prompt. You can also use less, head, and tail. Play around with these a bit, for example
$ less file.fastq
Hint: look at the bottom of your screen when you try ‘more’ or ‘less’.
'More' and 'less' are both terminal pagers which means we use them to view things on our terminal. The key difference is that with less you can go forwards and backwards while with more you can only go forwards.
Then try
$ head file.fastq
$ head -1 file.fastq
$ head -10 file.fastq
$ tail file.fastq
$ tail -2 file.fastq
This is a fastq file. It has four lines for each DNA sequencing read which are 1) the sequence read identifier; 2) the DNA sequence (or read); 3) "+" and 4) Quality scores for each base. We will go into more depth about .fastq files later but for now we will use this file to learn more shell commands. We typically use '.fastq' - note the 'dot' in the front there- to denote that this is a file type.
Count the number of unique lines in the file with 'wc'
$ wc file.fastq
What is the output? (Note: this is a real file and counting lines will take a little while)
Try
$ man wc
to understand the columns. Now try
$ wc -l file.fastq
where wc -l = "word count by line." How many lines do you have? How many actual DNA sequences do you have?
Again, wc is a very powerful command, so take some time to understand it. Type
$ man wc
and read through the manual pages.
How can you use wc to find the sequence read length (tell how many characters are in each DNA sequence)? Hint: it is easiest if you combine wc with another command. Double hint: there are always many different ways to do things in Unix and there are many correct answers to this question.
For example, my solution would be:
$ head -2 file.fastq | tail -1 | wc -c
What is this doing? Try to go through each command individually and understand the pieces of the operation. For example, start with
$ head -2 file.fastq
then
$ head -2 file.fastq | tail -1
then
$ head -2 file.fastq | tail -1 | wc -c
We are going to explore the standard shell commands cp (copy) and mv (move). Both commands take the syntax operation - from location 1- to location 2. To explore these, make a new directory
$ mkdir ./that_directory
and move your file to it
$ mv file.fastq ./that_directory
Now type
$ ls
What happens?
Navigate to your new directory and try again
$ cd that_directory
$ ls
Did you find file.fastq?
You can also create a copy of your file with a different name
$ cp file.fastq file_copy.fastq
or with the same name but in a different directory
$ cp file.fastq ../
What happens?
Searching within a file
We can search in a few different ways. Today we will learn about 'grep' which stands for globally search a regular expression and print. In general 'grep' is a command that will search for a specific pattern. We call these patterns 'regular expressions' because we can use different characters and syntax to specify what we want to search for. We are not going to go too deeply into grep (for that you will need to take Computational Biology next semester!) but we will go through a few fun features here.
Illumina sequence reads receive a 'chastity' mark that tells us about their quality. It's a little confusing because it tells us if they failed the filter or not so 'N' means no, did not fail filter and 'Y' means yes, did fail filter. Our 'Y' sequences are actually our bad ones and our 'N' sequences are our good ones. This filter mark comes at the end of the read identifier and we can search for it with grep
$ grep ':Y:' file.fastq
Here, grep has searched for this colon-Y-colon pattern in your file.fastq and you should see this pattern colored in red. On ASC grep will by default color these results for you but it will not on other systems. If you are doing this on your own computer you can add the flag --color=always to get that same color
$ grep ':Y:' file.fastq --color=always
You can see the colors very well with a longer string. For example, how many reads have a 10-bp adenine homopolymer? We can explore like this
$ grep 'AAAAAAAAAA' file.fastq
We can also use extended regular expressions (the -E flag below) to find our 10-bp adenine homopolymer like this
$ grep -E '([A]{10})' file.fastq
Learning regular expressions is simply learning how to build up these types of patterns.
We can use also grep to get some information about our file. We can count how many sequence reads did not pass chastity like this
$ grep -c ':Y:' file.fastq
Grep should now be returning a count of chastity-failers. We can also search for lines that do not contain our pattern with -v. For example, every sequence read will get a 'Y' or 'N'. Count your chastity-failers like this
$ grep -c ':Y:' file.fastq
and your chastity-passers like this
$ grep '^@HWI' file.fastq | grep -c -v ':Y:'
This should be equivalent to
$ grep -c ':N:' file.fastq
We can also ask grep to return the pieces of our file below and above our desired pattern. This is very useful in a multi-line file format like our .fastq. For example, let's say we want to find the read information for everyone of our 10-bp adenine homopolymers. We can ask grep with -A which delivers lines after our pattern or -B which delivers lines before our pattern. For our 10-bp adenine homopolymers it looks like this
$ grep -B 1 -E '([A]{10})' file.fastq
And now you should see the read identifer before each 10-bp homopolymer. Grep will put a '--' line inbetween each of your matched blocks of text.
Streams, pipes and redirection
First we will make a small file to operate on before doing analyses on our large file.
Type
$ head -10000 file.fastq > small_file.fastq
You have now created a small file for testing. Count the files in your original file.fastq and your new small_file.fastq like this
$ wc -l file.fastq
$ wc -l small_file.fastq
Type
$ cat small_file.fastq
What happens? The "cat" command will concatenate files. Here, we only give it one file and so it concatenates that file with our standard output (stdout), which is our terminal. If we give it two files we can stick them together, end to end. Try
$ cat small_file.fastq small_file.fastq > new.fastq
What happens (Hint: Check how many lines are in new.fastq with)
$ wc -l new.fastq
Vertical concatenation happens with "cat" and horizontal concatenation happens with "paste." For example, we often want to use data in a column format. Type
$ cat small_file.fastq | paste - - - - > column.fastq
Look at your file
$ less column.fastq
What happened?
"cat" is concatenating your file.fastq to the terminal (stdout), and the "|" character pipes the output of one command to another, here from "cat" to "paste." This is really one of the most powerful things about UNIX, the ability to both pipe the output of one command into another, and simple file redirection. Here, the redirect puts the stdout of paste into a file called column.fastq. Look at the file to see how it is formatted.
We can easily manipulate the columns of this file with the "cut" command. Try
$ cut -f 1 column.fastq
Now try
$ man cut
What is the "-f" doing? What happens if you do 2 or 3? Can you combine them (for example, 1-2, 1-3)?
Try piping multiple commands
$ cat column.fastq | cut -f 1 | sort
The Illumina read identifier is a physical location in the machine that tells us:
sequencing machine:run(numerical):run(identifier):lane:tile:x-coordinate:y-coordinate
Is your data from one tile or multiple tiles?
A very useful command is ‘uniq’ which stands for unique. You can use it to streamline multiples down to a single instance or to count instances with the -c flag.
One of the big problems in bioinformatics is that different people work on data generation and analysis and the molecular people know little about what the bioinformatics people are doing and vice versa. For example, a common molecular protocol is to add a short tag or ‘barcode’ to a molecular sample so you can sequence multiple samples together. We can use these barcodes after sequencing to separate samples. The barcode is typically the first 5-7 bp of the sequencing read. We can check for barcodes like this:
$ cat small_file.fastq | paste - - - - | cut -f 2 | cut -c1-6 | sort | uniq -c | sort -nr | head -20
Take some time to build this command piece by piece an play around with the flags (options) so you can really understand what’s going on here.
Do you think this sample has barcodes? In order to answer this question we first need to construct an expectation of what our sample would look like without barcodes. How many sequences do you have? We are examining the first six characters for a barcode-like pattern and we have 4 bases possible. A purely neutral probability of any one base in any one of the first six positions is then 1/4 and the probability of a string of any six bases is multiplicative or
1/4 * 1/4 * 1/4 * 1/4 * 1/4 * 1/4
For example, the probability of AAAAAA is equivalent to the equation above. Given that expectation, how many of any one string of six characters do you expect in your file? You can calculate like this
number of DNA sequences * probability of any string of six
This is your neutral expectation for any six-base sequence. Now, use the counting line I gave you above to count the frequency of the most common six nucleotide combinations in the begining of your sequences.
Try this with the large file.fastq. It may take some time to get through the large file with everyone on the head node.