Setting up the environment and basic quality control
Before we get going with our first RAD-seq assembly, we need to get set up and oriented to our working environment. We make no assumptions about prior experience with cluster environments, so we scaffold the entire participant workshop experience from first principles. More advanced users hopefully will find value in some of the finer details we present.
Tutorial documentation conventions
Each grey cell in this tutorial indicates a command line interaction. Lines starting with $ indicate a command that should be executed in a terminal, for example by copying and pasting the text into your terminal. All lines in code cells beginning with ## are comments and should not be copied and executed. Elements in code cells surrounded by angle brackets (e.g. <username>
) are variables that need to be replaced by the user. All other lines should be interpreted as output from the issued commands.
## Example Code Cell.
## Create an empty file in my home directory called `watdo.txt`
$ touch ~/watdo.txt
## Print "wat" to the screen
$ echo "wat"
wat
Examine the raw data
For this workshop we will be looking at and working with only the simulated data.
$ ls ipsimdata
gbs_example_barcodes.txt pairddrad_example_genome.fa.smi pairgbs_example_barcodes.txt rad_example_barcodes.txt
gbs_example_genome.fa pairddrad_example_R1_.fastq.gz pairgbs_example_R1_.fastq.gz rad_example_genome.fa
gbs_example_R1_.fastq.gz pairddrad_example_R2_.fastq.gz pairgbs_example_R2_.fastq.gz rad_example_genome.fa.fai
pairddrad_example_barcodes.txt pairddrad_wmerge_example_barcodes.txt pairgbs_wmerge_example_barcodes.txt rad_example_genome.fa.sma
pairddrad_example_genome.fa pairddrad_wmerge_example_genome.fa pairgbs_wmerge_example_genome.fa rad_example_genome.fa.smi
pairddrad_example_genome.fa.fai pairddrad_wmerge_example_R1_.fastq.gz pairgbs_wmerge_example_R1_.fastq.gz rad_example_R1_.fastq.gz
pairddrad_example_genome.fa.sma pairddrad_wmerge_example_R2_.fastq.gz pairgbs_wmerge_example_R2_.fastq.gz
Inspect the data
Then we will use the zcat
command to read lines of data from one of the files
and we will trim this to print only the first 20 lines by piping the output to
the head
command. Using a pipe (|
) like this passes the output from one
command to another and is a common trick in the command line.
Here we have our first look at a fastq formatted file. Each sequenced read is spread over four lines, one of which contains sequence and another the quality scores stored as ASCII characters. The other two lines are used as headers to store information about the read.
$ zcat ipsimdata/rad_example_R1_.fastq.gz | head -n 20
@lane1_locus0_2G_0_0 1:N:0:
CTCCAATCCTGCAGTTTAACTGTTCAAGTTGGCAAGATCAAGTCGTCCCTAGCCCCCGCGTCCGTTTTTACCTGGTCGCGGTCCCGACCCAGCTGCCCCC
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@lane1_locus0_2G_0_1 1:N:0:
CTCCAATCCTGCAGTTTAACTGTTCAAGTTGGCAAGATCAAGTCGTCCCTAGCCCCCGCGTCCGTTTTTACCTGGTCGCGGTCCCCACCCAGCTGCCCCC
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@lane1_locus0_2G_0_2 1:N:0:
CTCCAATCCTGCAGTTTAACTGTTCAAGTTGGCAAGATCAAGTCGTCCCTAGCCCCCGCGTCCGTTTTTACCTGGTCGCGGTCCCGACCCAGCTGCCCCC
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@lane1_locus0_2G_0_3 1:N:0:
CTCCAATCCTGCAGTTTAACTGTTCAAGTTGGCAAGATCAAGTCGTCCCTAGCCCCCGCGTCCGTTTTTACCTGGTCGCGGTCCCGACCCAGCTGCCCCC
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@lane1_locus0_2G_0_4 1:N:0:
CTCCAATCCTGCAGTTTAACTGTTCAAGTTGGCAAGATCAAGTCGTCCCTAGCCCCCGCGTCCGTTTTTACCTGGTCGCGGTCCCGACCCAGCTGCCCCC
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
Here the quality scores are all uniformly high (probability of miscalled base < 0.0005), because it’s simulated. The fastqc wikipedia page actually has a really nice section on quality score interpretation. Real data will look more like this:
@HWI-D00289:107:C3J97ACXX:6:1101:1420:2169 1:N:0:CGTACG
CGAATCATGCATTATTTATTTATGTATTACATTTGTATAGTGCTCTAACCCTCAGGCATCTGAGCGCTTCTTACATGGTTACATTTAGTTACATTGGTTGC
+
BB<BBBFBBB0<F<0<BBBFB<FF<<BBFBBBFBF<BF<BB<7B0B70BFFFFFFBF0BFBFBBBFIBBBBBFFB<<BBB7<<BB<<<BF7<<BBBBBBBB
@HWI-D00289:107:C3J97ACXX:6:1101:1737:2150 1:N:0:CGTACG
ACTTCCATGCTCACAGTGAAATCAGTGCTGGACATCGTGGGTCCAAGAAAACGGCCGAGATATTGATGGAAGTGGCTTACTGGCCGCATCTATATCAAGAT
+
B<BFFF<B0BFBFFFIBBFBIFIIFB0BBFFFFBFFFBFF7BFBBB7BFFB'BBFBFFF<FFFFF<BBBBBBBBBBBBBBBBBBBBBBBB<'<<BBBBB<B
@HWI-D00289:107:C3J97ACXX:6:1101:1806:2113 1:N:0:CGTACG
ACTGGCATGCTGACAATATGCACATGGATGATGACACACTACGATACTGGCATGTGAATGGAATATTGGGTGAGGGGCAACCACAATGCAATACTGTTGTA
+
BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIFFIIIIIIIIFIIIIIIIIIIFFIBFFIIFIIIFFBFIBFFFFFFFBBFFFFBFBBFFFFFFFFFFBBF
@HWI-D00289:107:C3J97ACXX:6:1101:1809:2136 1:N:0:CGTACG
ATTACAAGAAGGGACTCTGTCCCTGATTGGGAGCATATATTGAAGTGATCGTGATCTCCGTAAGCTTAATTCTGCCTTTTACTATAATGTACCTTCCCTTC
+
BBBFFBFFFFFFFF7FFF<BFFIFF0<B7BF'BFBBBBFFBFBFF7BBFFFFF7BBBFFF7FBFFFI'<BFFFFFF<BBF<BB0<BBBBBBBBB7'77<BB
FastQC for quality control
The first step of any RAD-Seq assembly is to inspect your raw data to estimate overall quality. We began first with a visual inspection above, but of course we can only visually inspect a very tiny proportion of the total data. So instead we use automated approaches to check the quality of our data.
At this stage you can then attempt to improve your dataset by identifying and removing samples with failed sequencing. Another key QC procedure involves inspecting average quality scores per base position and trimming read edges, which is where low quality base-calls tend to accumulate. In this figure, the X-axis shows the position on the read in base-pairs and the Y-axis depicts information about Phred quality score per base for all reads, including median (center red line), IQR (yellow box), and 10%-90% (whiskers). As an example, here is a very clean base sequence quality report for a 75bp RAD-Seq library. These reads have generally high quality across their entire length, with only a slight (barely worth mentioning) dip toward the end of the reads:
In contrast, here is a somewhat typical base sequence quality report for R1 of a 300bp paired-end Illumina run of ezrad data:
This figure depicts a common artifact of current Illumina chemistry, whereby quality scores per base drop off precipitously toward the ends of reads, with the effect being magnified for read lengths > 150bp. The purpose of using FastQC to examine reads is to determine whether and how much to trim our reads to reduce sequencing error interfering with basecalling. In the above figure, as in most real dataset, we can see there is a tradeoff between throwing out data to increase overall quality by trimming for shorter length, and retaining data to increase value obtained from sequencing with the result of increasing noise toward the ends of reads.
Running FastQC
In the interest of saving time during this short workshop we will only indicate how fastqc is run, and will then focus on interpretation of typical output. More detailed information about actually running fastqc are available elsewhere on the RADCamp site.
For future reference, you can install fastqc from the bioconda channel:
$ conda install -c bioconda fastqc -y
NB: The -y
here just tells conda “When you ask if I’m sure I want to
install, the answer is YES!”, prevents it from prompting you.
To run fastqc on one of the simulated samples you may execute this command:
$ mkdir fastqc-results
# The -o flag tells fastqc where to write output files.
$ fastqc -o fastqc-results ipsimdata/rad_example_R1_.fastq.gz
Note: Running fastqc on real data can be somewhat slow, plus most of the samples from one library should look mostly the same. It is always good practice to actually look at your data before proceeding with the assembly, but you don’t have to look at all of it. Typically 3-5 randomly selected samples should give a good overall view of quality.
Inspecting and Interpreting FastQC Output
Simulated data is too clean to be interesting, so lets spend a moment looking at the results from some real data from Prates et al (2016) (Anolis punctatus, GBS data).
Lets start with Per base sequence quality, because it’s very easy to interpret, and often times with RAD-Seq data results here will be of special importance.
For the Anolis data the sequence quality per base is uniformly quite high, with dips only in the first and last 5 bases (again this is typical for Illumina reads). Based on information from this plot we can see that the Anolis data doesn’t need a whole lot of trimming, which is good.
Now lets look at the Per base sequece content
, which FastQC highlights with a
scary red X.
The squiggles indicate base composition per base position averaged across the
reads. It looks like the signal FastQC is concerned about here is related to the
extreme base composition bias of the first 5 positions. We happen to know this
is a result of the restriction enzyme overhang present in all reads (TGCAT
in
this case for the EcoT22I enzyme used), and so it is in fact of no concern. Now
lets look at Adapter Content
:
Here we can see adapter contamination increases toward the tail of the reads, approaching 40% of total read content at the very end. The concern here is that if adapters represent some significant fraction of the read pool, then they will be treated as “real” data, and potentially bias downstream analysis. In the Anolis data this looks like it might be a real concern so if we were assembling this dataset we’d want to keep this in mind during step 2 of the ipyrad analysis, and incorporate 3’ read trimming and aggressive adapter filtering.
Other than this, the data look good and we can proceed with the ipyrad analysis.
References
Prates, I., Xue, A. T., Brown, J. L., Alvarado-Serrano, D. F., Rodrigues, M. T., Hickerson, M. J., & Carnaval, A. C. (2016). Inferring responses to climate dynamics from historical demography in neotropical forest lizards. Proceedings of the National Academy of Sciences, 113(29), 7978-7985.