1. Understand and describe the structure and format of RAD-seq data.
2. Assemble RAD-seq data sets using ipyrad to produce files for analyses.
3. Learn about phylogenetic and population genetics applications of RAD-seq data.
4. Learn new bioinformatic skills: bash, linux, Python, jupyter.
Please follow along:
https://radcamp.github.io/SanFrancisco2024/
RAD-seq is a method for subsampling a reduced represenation of a genome by selecting and sequencing genomic fragments near restriction enzyme recognition sites. The goal is to efficiently obtain orthologous genomic regions/loci across many samples to identify genetic variation, without the need to sequence entire genomes.
Variant methods differ in their cost, complexity, and efficiency.
Why not just sequence the entire genome at low-coverage?
Low coverage whole genome sequencing (WGS) has had a resurgence in popularity
as the cost of sequencing has decreased, and new imputation methods have been
developed for genotyping very low coverage data.
However, many organisms have very large genomes which can make the cost prohibitive.
Also, the analysis of WGS data requires a good reference genome, and imputation of
low coverage data additionally requires a high quality/depth reference panel, which
is a large additional cost.
For biodiversity research on non-model organisms it is faster, easier, and usually
sufficient, to use RAD-seq rather than develop many other additional genetic resources.
Why is missing data a problem in RAD-seq?
Part of the efficiency of RAD-seq is accomplished through multiplexing samples, and
errors in this step can lead to variable coverage across samples. In addition, some
restriction fragments will be present in only some samples and not others. As a
consequence, assembled RAD data sets are often sparse, containing many loci that are
present in only a subset of samples, and fewer loci present in every single sample.
This problem is not unique to RAD-seq, but has received significant attention. It
does need to be accommodated in many downstream analyses.
We will discuss methods for this tomorrow in our analysis
workshop.
Can I detect selection, or perform GWAS using RAD-seq?
Yes, RAD-seq can be used for both of these goals (e.g., Nadeau et al. 2013).
However, the goal of your study are important determinants of whether or not RAD-seq
is more efficient than WGS. Detecting selection or trait associations is much better
when mapping RAD or WGS to a reference genome, as opposed to anonymous de novo loci.
How many loci/SNPs do I need? How many will I get?
We say that 1,000,000 raw reads per sample is a good starting point, and 10k loci per sample
seems to be a good target for the final assembly. More is usually better.
How many loci you end up with depends on Genome size, RE frequency, size selection, how much sequencing you do,
how many samples you've multiplexed, and a few other things!
At what phylogenetic scale can I use RAD-seq?
RAD-seq data generated for distantly related samples are expected to share
fewer orthologous fragments in common, leading to increased missing data. At what
scale is RAD-seq no longer useful? It depends on many factors: rate of
substitutions and genome size change between samples; how many fragments, i.e.,
which enzymes are used; the type of protocol (shearing vs digestion); and sequencing
coverage.
However, RAD-seq has been successfully applied to phylogenetic
questions spanning >30 Mya (oaks; Eaton et al. 2015), >60 Mya (Viburnum; Eaton et al. 2017).
The efficiency of RAD vs. [others] depends on the scale and density of your taxon
sampling, and planned analyses.