ipyrad command line tutorial - Part I

This is the first part of the full tutorial for the command line interface (CLI) for ipyrad. In this tutorial we’ll walk through the entire assembly and analysis process. This is meant as a broad introduction to familiarize users with the general workflow, and some of the parameters and terminology. We will use as an example in this tutorial the Anolis data set from the first part of class. However, you can follow along with one of the other example data sets if you like and although your results will vary the procedure will be identical.

If you are new to RADseq analyses, this tutorial will provide a simple overview of how to execute ipyrad, what the data files look like, how to check that your analysis is working, and what the final output formats will be. We will also cover how to run ipyrad on a cluster and to do so efficiently.

Each grey cell in this tutorial indicates a command line interaction. Lines starting with $ indicate a command that should be executed in a terminal connected to the Habanero cluster, for example by copying and pasting the text into your terminal. Elements in code cells surrounded by angle brackets (e.g. ) are variables that need to be replaced by the user. All lines in code cells beginning with \#\# are comments and should not be copied and executed. All other lines should be interpreted as output from the issued commands.

## Example Code Cell.
## Create an empty file in my home directory called `watdo.txt`
$ touch ~/watdo.txt

## Print "wat" to the screen
$ echo "wat"
wat

Fetch and unpack the simulated data

First we will practice good data management habits by creating a new directory for the workshop materials, switching to this directory, and fetching and unpacking the simulated data

# Practice good data management habits. Make a directory for the tutorial.
(ipyrad)$ cd ~
(ipyrad)$ mkdir ipyrad-assembly
(ipyrad)$ cd ipyrad-assembly

# Fetch ipyrad github repo (for the simulated data)
(ipyrad)$ git clone https://github.com/dereneaton/ipyrad

# Extract simulated data
(ipyrad)$ tar -xvzf ipyrad/tests/ipsimdata.tar.gz

Overview of Assembly Steps

Very roughly speaking, ipyrad exists to transform raw data coming off the sequencing instrument into output files that you can use for downstream analysis.

png

The basic steps of this process are as follows:

Step 1 - Demultiplex/Load Raw Data
Step 2 - Trim and Quality Control
Step 3 - Cluster or reference-map within Samples
Step 4 - Calculate Error Rate and Heterozygosity
Step 5 - Call consensus sequences/alleles
Step 6 - Cluster across Samples
Step 7 - Apply filters and write output formats

Note on files in the project directory: Assembling RAD-seq type sequence data requires a lot of different steps, and these steps generate a lot of intermediary files. ipyrad organizes these files into directories, and it prepends the name of your assembly to each directory with data that belongs to it. One result of this is that you can have multiple assemblies of the same raw data with different parameter settings and you don’t have to manage all the files yourself! (See Branching assemblies for more info). Another result is that you should not rename or move any of the directories inside your project directory, unless you know what you’re doing or you don’t mind if your assembly breaks.

Getting started

One benefit of the virtual machine we’re using for this workshop is that it conceals some of the complexity of working in a real production environment, such as with an HPC system at your home campus. In this case we provide extensive documentation about using ipyrad on HPC systems elsewhere on the RADCamp site.

ipyrad help

To better understand how to use ipyrad, let’s take a look at the help argument. We will use some of the ipyrad arguments in this tutorial (for example: -n, -p, -s, -c, -r). The complete list of optional arguments and their explanation can be accessed with the --help flag:

$ ipyrad --help
usage: ipyrad [-h] [-v] [-r] [-f] [-q] [-d] [-n NEW] [-p PARAMS] [-s STEPS]
              [-b [BRANCH [BRANCH ...]]] [-m [MERGE [MERGE ...]]] [-c cores]
              [-t threading] [--MPI] [--ipcluster [IPCLUSTER]]
              [--download [DOWNLOAD [DOWNLOAD ...]]]

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -r, --results         show results summary for Assembly in params.txt and
                        exit
  -f, --force           force overwrite of existing data
  -q, --quiet           do not print to stderror or stdout.
  -d, --debug           print lots more info to ipyrad_log.txt.
  -n NEW                create new file 'params-{new}.txt' in current
                        directory
  -p PARAMS             path to params file for Assembly:
                        params-{assembly_name}.txt
  -s STEPS              Set of assembly steps to run, e.g., -s 123
  -b [BRANCH [BRANCH ...]]
                        create new branch of Assembly as params-{branch}.txt,
                        and can be used to drop samples from Assembly.
  -m [MERGE [MERGE ...]]
                        merge multiple Assemblies into one joint Assembly, and
                        can be used to merge Samples into one Sample.
  -c cores              number of CPU cores to use (Default=0=All)
  -t threading          tune threading of multi-threaded binaries (Default=2)
  --MPI                 connect to parallel CPUs across multiple nodes
  --ipcluster [IPCLUSTER]
                        connect to running ipcluster, enter profile name or
                        profile='default'
  --download [DOWNLOAD [DOWNLOAD ...]]
                        download fastq files by accession (e.g., SRP or SRR)

  * Example command-line usage: 
    ipyrad -n data                       ## create new file called params-data.txt 
    ipyrad -p params-data.txt -s 123     ## run only steps 1-3 of assembly.
    ipyrad -p params-data.txt -s 3 -f    ## run step 3, overwrite existing data.

  * HPC parallelization across 32 cores
    ipyrad -p params-data.txt -s 3 -c 32 --MPI

  * Print results summary 
    ipyrad -p params-data.txt -r 

  * Branch/Merging Assemblies
    ipyrad -p params-data.txt -b newdata  
    ipyrad -m newdata params-1.txt params-2.txt [params-3.txt, ...]


  * Subsample taxa during branching
    ipyrad -p params-data.txt -b newdata taxaKeepList.txt

  * Download sequence data from SRA into directory 'sra-fastqs/' 
    ipyrad --download SRP021469 sra-fastqs/

  * Documentation: http://ipyrad.readthedocs.io

Create a new parameters file

ipyrad uses a text file to hold all the parameters for a given assembly. Start by creating a new parameters file with the -n flag. This flag requires you to pass in a name for your assembly. In the example we use simdata but the name can be anything at all. Once you start analysing your own data you might call your parameters file something more informative, like the name of your organism and some details on the settings.

# Make sure you're in your ipyrad-assembly directory
$ pwd

# create a new params file named 'simdata'
$ ipyrad -n simdata

Note: The pwd command (prints working directory).

This will create a file in the current directory called params-simdata.txt. The params file lists on each line one parameter followed by a ## mark, then the name of the parameter, and then a short description of its purpose. Lets take a look at it.

$ cat params-simdata.txt
------- ipyrad params file (v.0.9.26)-------------------------------------------
simdata                        ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
/home/radcamp2020/ipyrad-assembly              ## [1] [project_dir]: Project dir (made in curdir if not present)
                               ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
                               ## [3] [barcodes_path]: Location of barcodes file
                               ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
denovo                         ## [5] [assembly_method]: Assembly method (denovo, reference)
                               ## [6] [reference_sequence]: Location of reference sequence file
rad                            ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
TGCAG,                         ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2)
5                              ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read
33                             ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)
6                              ## [11] [mindepth_statistical]: Min depth for statistical base calling
6                              ## [12] [mindepth_majrule]: Min depth for majority-rule base calling
10000                          ## [13] [maxdepth]: Max cluster depth within samples
0.85                           ## [14] [clust_threshold]: Clustering threshold for de novo assembly
0                              ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes
0                              ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)
35                             ## [17] [filter_min_trim_len]: Min length of reads after adapter trim
2                              ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences
0.05                           ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus
0.05                           ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus
4                              ## [21] [min_samples_locus]: Min # samples per locus for output
0.2                            ## [22] [max_SNPs_locus]: Max # SNPs per locus
8                              ## [23] [max_Indels_locus]: Max # of indels per locus
0.5                            ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus
0, 0, 0, 0                     ## [25] [trim_reads]: Trim raw read edges (R1>, <R1, R2>, <R2) (see docs)
0, 0, 0, 0                     ## [26] [trim_loci]: Trim locus edges (see docs) (R1>, <R1, R2>, <R2)
p, s, l                        ## [27] [output_formats]: Output formats (see docs)
                               ## [28] [pop_assign_file]: Path to population assignment file
                               ## [29] [reference_as_filter]: Reads mapped to this reference are removed in step 3

In general the defaults are sensible, and we won’t mess with them for now, but there are a few parameters we must change: the path to the raw data, the dataype, the restriction overhang sequence, and the barcodes file.

We will use the nano text editor to modify params-simdata.txt and change these parameters:

$ nano params-simdata.txt

png

Nano is a command line editor, so you’ll need to use only the arrow keys on the keyboard for navigating around the file. Nano accepts a few special keyboard commands for doing things other than modifying text, and it lists these on the bottom of the frame.

We need to specify where the raw data files are located, the type of data we are using (.e.g., ‘gbs’, ‘rad’, ‘ddrad’, ‘pairddrad), and which enzyme cut site overhangs are expected to be present on the reads. Below are the parameter setings you’ll need to change for the simulated single-end RAD example data:

./ipsimdata/rad_example_R1_.fastq.gz        ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
./ipsimdata/rad_example_barcodes.txt        ## [3] [barcodes_path]: Location of barcodes file
rad                            ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
TGCAG,                         ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2)

After you change these parameters you may save and exit nano by typing CTRL+o (to write Output), and then CTRL+x (to eXit the program).

Note: The CTRL+x notation indicates that you should hold down the control key (which is often styled ‘ctrl’ on the keyboard) and then push ‘x’.

Once we start running the analysis ipyrad will create several new directories to hold the output of each step for this assembly. By default the new directories are created in the project_dir directory and use the prefix specified by the assembly_name parameter. Because we use the default (/home/radcamp2020/ipyrad-assembly) for the project_dir for this tutorial, all these intermediate directories will be of the form: ~/ipyrad-assembly/simdata_*, or the analagous name that you used for your assembly name.

Input data format

Before we get started let’s take a look at what the raw data looks like.

Your input data will be in fastQ format, usually ending in .fq, .fastq, .fq.gz, or .fastq.gz. The file/s may be compressed with gzip so that they have a .gz ending, but they do not need to be. Lets take a look at first three reads of one of the simulated data.

$ zcat ipsimdata/rad_example_R1_.fastq.gz | head -n 12

@lane1_locus0_2G_0_0 1:N:0:
CTCCAATCCTGCAGTTTAACTGTTCAAGTTGGCAAGATCAAGTCGTCCCTAGCCCCCGCGTCCGTTTTTACCTGGTCGCGGTCCCGACCCAGCTGCCCCC
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@lane1_locus0_2G_0_1 1:N:0:
CTCCAATCCTGCAGTTTAACTGTTCAAGTTGGCAAGATCAAGTCGTCCCTAGCCCCCGCGTCCGTTTTTACCTGGTCGCGGTCCCCACCCAGCTGCCCCC
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@lane1_locus0_2G_0_2 1:N:0:
CTCCAATCCTGCAGTTTAACTGTTCAAGTTGGCAAGATCAAGTCGTCCCTAGCCCCCGCGTCCGTTTTTACCTGGTCGCGGTCCCGACCCAGCTGCCCCC
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Exercise for the reader: Can you find and verify the overhang sequence in the simulated data? Hint: It’s not right at the beginning of the sequence, which is where you might expect it to be…. It’s always a good idea to look at your data to check for the cut site. Your first sign of a messy dataset is lots of off target reads, basically stuff that got sequenced that isn’t associated with a restriction enzyme cutsite.

Step 1: Demultiplexing the raw data

Commonly, sequencing facilities will give you one giant .gz file that contains all the reads from all the samples all mixed up together. Step 1 is all about sorting out which reads belong to which samples, so this is where the barcodes file comes in handy. The barcodes file is a simple text file mapping sample names to barcode sequences. Lets look at the simulated barcodes:

$ cat ./ipsimdata/rad_example_barcodes.txt
1A_0    CATCATCAT
1B_0    CCAGTGATA
1C_0    TGGCCTAGT
1D_0    GGGAAAAAC
2E_0    GTGGATATC
2F_0    AGAGCCGAG
2G_0    CTCCAATCC
2H_0    CTCACTGCA
3I_0    GGCGCATAC
3J_0    CCTTATGTC
3K_0    ACGTGTGTG
3L_0    TTACTAACA

Here the barcodes are all the same length, but ipyrad can also handle variable length barcodes, and in some cases multiplexed barcodes (3RAD and variants). We can also allow for varying amounts of sequencing error in the barcode sequences (parameter 15, max_barcode_mismatch).

Note on step 1: Occasionally sequencing facilities will send back data already demultiplexed to samples. This is totally fine, and is handled natively by ipyrad. In this case you would use the sorted_fastq_path in the params file to indiciate the sample fastq.gz files. ipyrad will then scan the samples and load in the raw data.

Now lets run step 1! For the simulated data this will take <10 seconds.

Special Note: In command line mode please be aware to always specify the number of cores with the -c flag. If you do not specify the number of cores ipyrad assumes you want all of them, which will result in you hogging up all the CPU on whatever computer you’re on. This is fine inside the VM, but if you’re on an HPC this can cause problems!

## -p    the params file we wish to use
## -s    the step to run
## -c    the number of cores to allocate   <-- Important!
$ ipyrad -p params-simdata.txt -s 1 -c 3

 -------------------------------------------------------------
  ipyrad [v.0.9.26]
  Interactive assembly and analysis of RAD-seq data
 ------------------------------------------------------------- 
  Parallel connection | radcamp2020-VirtualBox: 3 cores
  
  Step 1: Demultiplexing fastq data to Samples
  [####################] 100% 0:00:10 | sorting reads          
  [####################] 100% 0:00:02 | writing/compressing    

  Parallel connection closed.

In-depth operations of running an ipyrad step

Any time ipyrad is invoked it performs a few housekeeping operations:

Load the assembly object - Since this is our first time running any steps we need to initialize our assembly.
Start the parallel cluster - ipyrad uses a parallelization library called ipyparallel. Every time we start a step we fire up the parallel clients. This makes your assemblies go smokin’ fast, and also allows to distribute jobs across compute nodes on an HPC.
Do the work - Actually perform the work of the requested step(s) (in this case demux’ing in sample reads).
Save, clean up, and exit - Save the state of the assembly and spin down the ipyparallel cluster.

The results of step 1 are saved in the *_fastqs directory, where * is a wildcard that stands in for the name of your assembly. In our case it is simdata_fastqs. Lets look inside this directory:

$ ls simdata_fastqs/
1A_0_R1_.fastq.gz  1C_0_R1_.fastq.gz  2E_0_R1_.fastq.gz  2G_0_R1_.fastq.gz  3I_0_R1_.fastq.gz  3K_0_R1_.fastq.gz  s1_demultiplex_stats.txt
1B_0_R1_.fastq.gz  1D_0_R1_.fastq.gz  2F_0_R1_.fastq.gz  2H_0_R1_.fastq.gz  3J_0_R1_.fastq.gz  3L_0_R1_.fastq.gz

As a convenience, ipyrad internally tracks the state of all your steps in your current assembly, so at any time you can ask for results by invoking the -r flag. We also use the -p flag to tell it which params file (i.e., which assembly) we want to print stats for.

## -r fetches informative results from currently executed steps  
$ ipyrad -p params-simdata.txt -r

Summary stats of Assembly simdata
------------------------------------------------
      state  reads_raw
1A_0      1      19862
1B_0      1      20043
1C_0      1      20136
1D_0      1      19966
2E_0      1      20017
2F_0      1      19933
2G_0      1      20030
2H_0      1      20199
3I_0      1      19885
3J_0      1      19822
3K_0      1      19965
3L_0      1      20008

Full stats files
------------------------------------------------
step 1: /home/radcamp2020/ipyrad-analysis/simdata_fastqs/s1_demultiplex_stats.txt
step 2: None
step 3: None
step 4: None
step 5: None
step 6: None
step 7: None

If you want to get even more info ipyrad tracks all kinds of wacky stats and saves them to files inside the directories it creates for each step. For instance to see full stats for step 1 (the wackyness of the step 1 stats at this point isn’t very interesting, but we’ll see stats for later steps are more verbose):

$  cat simdata_fastqs/s1_demultiplex_stats.txt

raw_file                               total_reads    cut_found  bar_matched
rad_example_R1_.fastq                       239866       239866       239866

sample_name                            total_reads
1A_0                                         19862
1B_0                                         20043
1C_0                                         20136
1D_0                                         19966
2E_0                                         20017
2F_0                                         19933
2G_0                                         20030
2H_0                                         20199
3I_0                                         19885
3J_0                                         19822
3K_0                                         19965
3L_0                                         20008

sample_name                               true_bar       obs_bar     N_records
1A_0                                     CATCATCAT     CATCATCAT         19862
1B_0                                     CCAGTGATA     CCAGTGATA         20043
1C_0                                     TGGCCTAGT     TGGCCTAGT         20136
1D_0                                     GGGAAAAAC     GGGAAAAAC         19966
2E_0                                     GTGGATATC     GTGGATATC         20017
2F_0                                     AGAGCCGAG     AGAGCCGAG         19933
2G_0                                     CTCCAATCC     CTCCAATCC         20030
2H_0                                     CTCACTGCA     CTCACTGCA         20199
3I_0                                     GGCGCATAC     GGCGCATAC         19885
3J_0                                     CCTTATGTC     CCTTATGTC         19822
3K_0                                     ACGTGTGTG     ACGTGTGTG         19965
3L_0                                     TTACTAACA     TTACTAACA         20008
no_match                                         _            _            0

Another early indicator of trouble is if you have a ton of reads that are no_match. This means maybe your barcodes file is wrong, or maybe your library prep went poorly. Here, with the simulated data we have no unmatched barcodes, because, well, it’s simulated.

Step 2: Filter reads

This step filters reads based on quality scores and maximum number of uncalled bases, and can be used to detect Illumina adapters in your reads, which is sometimes a problem under a couple different library prep scenarios. Since it’s not atypical to have adapter contamination issues and to have a little noise toward the distal end of the reads lets imagine this is true of the simulated data, and we’ll try to account for this by trimming reads to 90bp and using aggressive adapter filtering.

Edit your params file again with nano:

nano params-simdata.txt

and change the following two parameter settings:

2                               ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)
0, 90, 0, 0                     ## [25] [trim_reads]: Trim raw read edges (R1>, <R1, R2>, <R2) (see docs)

Note: Saving and quitting from nano: CTRL+o then CTRL+x

$ ipyrad -p params-simdata.txt -s 2 -c 3

  loading Assembly: simdata
  from saved path: ~/ipyrad-assembly/simdata.json

 -------------------------------------------------------------
  ipyrad [v.0.9.26]
  Interactive assembly and analysis of RAD-seq data
 ------------------------------------------------------------- 
  Parallel connection | radcamp2020-VirtualBox: 3 cores
  
  Step 2: Filtering and trimming reads
  [####################] 100% 0:00:08 | processing reads     

  Parallel connection closed.

The filtered files are written to a new directory called simdata_edits. Again, you can look at the results output by this step and also some handy stats tracked for this assembly.

## View the output of step 2
$ cat simdata_edits/s2_rawedit_stats.txt 
      reads_raw  trim_adapter_bp_read1  trim_quality_bp_read1  reads_filtered_by_Ns  reads_filtered_by_minlen  reads_passed_filter
1A_0      19862                    360                      0                     0                         0  19862
1B_0      20043                    362                      0                     0                         0  20043
1C_0      20136                    349                      0                     0                         0  20136
1D_0      19966                    404                      0                     0                         0  19966
2E_0      20017                    394                      0                     0                         0  20017
2F_0      19933                    376                      0                     0                         0  19933
2G_0      20030                    381                      0                     0                         0  20030
2H_0      20199                    386                      0                     0                         1  20198
3I_0      19885                    372                      0                     0                         0  19885
3J_0      19822                    381                      0                     0                         0  19822
3K_0      19965                    382                      0                     0                         0  19965
3L_0      20008                    424                      0                     0                         0  20008

It’s a little boring, the reads are too clean. Here is an example of something like you’d see from real data (this is the Anolis dataset). Notice the reads_passed_filter value. This dataset is decent, as you can see we’re losing < 10% of the reads per sample, mostly due to the minimum length cutoff.

                   reads_raw  trim_adapter_bp_read1  trim_quality_bp_read1  reads_filtered_by_Ns  reads_filtered_by_minlen  reads_passed_filter
punc_IBSPCRIB0361     250000                 108761                 160210                    66                     12415               237519
punc_ICST764          250000                 107320                 178463                    68                     13117               236815
punc_JFT773           250000                 110684                 190803                    46                      9852               240102
punc_MTR05978         250000                 102932                 144773                    54                     12242               237704
punc_MTR17744         250000                 103394                 211363                    55                      9549               240396
punc_MTR21545         250000                 119191                 161709                    63                     21972               227965
punc_MTR34414         250000                 109207                 193401                    54                     16372               233574
punc_MTRX1468         250000                 119746                 134069                    45                     19052               230903
punc_MTRX1478         250000                 116009                 184189                    53                     16549               233398
punc_MUFAL9635        250000                 114492                 182877                    61                     18071               231868

## Get current stats including # raw reads and # reads after filtering.
$ ipyrad -p params-simdata.txt -r

You might also take a closer look at the filtered reads:

$ zcat simdata_edits/1A_0.trimmed_R1_.fastq.gz | head -n 12

@lane1_locus0_1A_0_0 1:N:0:
TGCAGTTTAACTGTTCAAGTTGGCAAGATCAAGTCGTCCCTAGCCCCCGCGTCCGTTTTTACCTGGTCGCGGTCCCGACCCAGCTGCCCC
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@lane1_locus0_1A_0_1 1:N:0:
TGCAGTTTAACTGTTCAAGTTGGCAAGATCAAGTCGTCCCTAGCCCCCGCGTCCGTTTTTACCTGGTCGCGGTCCCGACCCAGCTGCCCC
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@lane1_locus0_1A_0_2 1:N:0:
TGCAGTTTAACTGTTCAAGTTGGCAAGATCAAGTCGTCCCTAGCCCCCGCGTCCGTTTTTACCTGGTCGCGGTCCCGACCCAGCTGCCCC
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Since the adapter content of the simulated data is effectively 0, the net effect of step 2 is that the reads have been trimmed to 90bp. This isn’t necessary here, but it provides a good example since real data typically will need trimming.

Step 3: clustering within-samples

Step 3 de-replicates and then clusters reads within each sample by the set clustering threshold and then writes the clusters to new files in a directory called simdata_clust_0.85. Intuitively we are trying to identify all the reads that map to the same locus within each sample. The clustering threshold specifies the minimum percentage of sequence similarity below which we will consider two reads to have come from different loci.

The true name of this output directory will be dictated by the value you set for the clust_threshold parameter in the params file. This makes it very easy to test different clustering thresholds, and keep the different runs organized (since you will have for example simdata_clust_0.85 and simdata_clust_0.9).

You can see the default value is 0.85, so our default directory is named accordingly. This value dictates the percentage of sequence similarity that reads must have in order to be considered reads at the same locus. You’ll more than likely want to experiment with this value, but 0.85 is a reliable default, balancing over-splitting of loci vs over-lumping. Don’t mess with this until you feel comfortable with the overall workflow, and also until you’ve learned about Branching assemblies (which we may get to later this afternoon).

There have been many papers written comparing how results of assemblies vary depending on the clustering threshold. In general, my advice is to use a value between about .82 and .95. Within this region results typically do not vary too significantly, whereas above .95 you will oversplit loci and recover fewer SNPs.

It’s also possible to incorporate information from a reference genome to improve clustering at this step, if such a resources is available for your organism (or one that is relatively closely related). We will not cover reference based assemblies in this workshop, but you can refer to the ipyrad documentation for more information.

Note on performance: Steps 3 and 6 generally take considerably longer than any of the other steps, due to the resource intensive clustering and alignment phases. These can take on the order of 10-100x as long as the next longest running step. This depends heavily on the number of samples in your dataset, the number of cores on your computer, the length(s) of your reads, and the “messiness” of your data in terms of the number of unique loci present (this can vary from a few thousand to many millions).

Now lets run step 3:

$ ipyrad -p params-simdata.txt -s 3 -c 3

  loading Assembly: simdata
  from saved path: ~/ipyrad-assembly/simdata.json

 -------------------------------------------------------------
  ipyrad [v.0.9.26]
  Interactive assembly and analysis of RAD-seq data
 ------------------------------------------------------------- 
  Parallel connection | radcamp2020-VirtualBox: 3 cores
  
  Step 3: Clustering/Mapping reads within samples
  [####################] 100% 0:00:02 | dereplicating          
  [####################] 100% 0:00:03 | clustering/mapping     
  [####################] 100% 0:00:00 | building clusters      
  [####################] 100% 0:00:00 | chunking clusters      
  [####################] 100% 0:00:48 | aligning clusters      
  [####################] 100% 0:00:00 | concat clusters        
  [####################] 100% 0:00:00 | calc cluster stats     

  Parallel connection closed.

In-depth operations of step 3:

dereplicating - Merge all identical reads
clustering/mapping - Find reads matching by sequence similarity threshold
building clusters - Group similar reads into clusters
chunking clusters - Subsample cluster files to improve performance of alignment step
aligning clusters - Align all clusters
concat clusters - Gather chunked clusters into one full file of aligned clusters
calc cluster stats - Just as it says!

Again we can examine the results. The stats output tells you how many clusters were found (‘clusters_total’), and the number of clusters that pass the mindepth thresholds (‘clusters_hidepth’). We go into more detail about mindepth settings in some of the advanced tutorials but for now all you need to know is that by default step 3 will filter out clusters that only have a handful of reads on the assumption that it will be difficult to accurately call bases at such low depth.

$ ipyrad -p params-simdata.txt -r

Summary stats of Assembly simdata
------------------------------------------------
      state  reads_raw  reads_passed_filter  clusters_total  clusters_hidepth
1A_0      3      19862                19862            1000              1000
1B_0      3      20043                20043            1000              1000
1C_0      3      20136                20136            1000              1000
1D_0      3      19966                19966            1000              1000
2E_0      3      20017                20017            1000              1000
2F_0      3      19933                19933            1000              1000
2G_0      3      20030                20030            1000              1000
2H_0      3      20199                20198            1000              1000
3I_0      3      19885                19885            1000              1000
3J_0      3      19822                19822            1000              1000
3K_0      3      19965                19965            1000              1000
3L_0      3      20008                20008            1000              1000


Full stats files
------------------------------------------------
step 1: ./simdata_fastqs/s1_demultiplex_stats.txt
step 2: ./simdata_edits/s2_rawedit_stats.txt
step 3: ./simdata_clust_0.85/s3_cluster_stats.txt

Again, the final output of step 3 is dereplicated, clustered files for each sample in ./simdata_0.85/. You can get a feel for what this looks like by examining a portion of one of the files.

## Same as above, `zcat` unzips and prints to the screen and 
## `head -n 24` means just show me the first 24 lines. 
$  zcat simdata_clust_0.85/1A_0.clustS.gz | head -n 24

009149cc23d2367f21b67ac0060d9f2f;size=18;*
TGCAGATAAATCAAACTGCAGCTTGATATGGGCTTCGACCCAGTGGTGGTAGCCTCTCTCTCCCAGTATAACCTCGACCCCAAAATCGCA
d498af3d4575b871de6d5a7f239279ea;size=1;+
TGCAGATAAATCAAACTGCAGCTTGATATGGGCTTCGACCCAGTGGTGGTAGCCTCTCTCTCCCAGTATAACCTCGACCCCAAAATCGCT
b71555537ed7f88329fda094cc6cef8a;size=1;+
TGCAGATAAATCAAACTGCAGCTTGATATGGGCTTCGACCCAGTGGTGGTAGCGTCTCTCTCCCAGTATAACCTCGACCCCAAAATCGCA
//
//
00f1daaa8dd241bd72db91aa62b31bb4;size=8;*
TGCAGGGGTTAGGCGTATCTGCCAAAGATTCTTCGATCGTGATGATTCTAGACGACAATACACCTGATGCTTCTCGCATGCATAGCAATG
6780649efadfc8c182cfd2af7071316b;size=8;+
TGCAGGGGTTAGGCGTATCTGCCAAAGATTCTTCGATCGTGATGATTCTAGAGGACAATACACCTGATGCTTCTCGCATGCATAGCAATG
23a7b43b7f5008017574400c460982dc;size=1;+
TGCAGGGGTTAGGCGTATCTTCCAAAGATTCTTCGATCGTGATGATTCTAGACGACAATACACCTGATGCTTCTCGCATGCATAGCAATG
e6830f9099df558397f0fd28bf9568b6;size=1;+
TGCAGGGGTTAGGCGTATCTGCCAAAGATTCTTCGATCGTGATGATTCTAGACGATAATACACCTGATGCTTCTCGCATGCATAGCAATG
//
//
013b4e939c785d94369ea933f7f98f0c;size=18;*
TGCAGATACTTCGCCCGGTTCTCCATACCCCATTCTTTGCTGCTTCTTCTGAGCGCACTCGACCTATGCCTAGTCGCACCTCGCATATTT
a7e612c565f1d70f054864759b58205f;size=1;+
TGCAGATACTTCGCCCGGTTCTCCATACCCCATTCTTTGCTGCTTCTTCTGAGCGCACTCGACCTATGCCTAGTCCCACCTCGCATATTT
//
//

Reads that are sufficiently similar (based on the above sequence similarity threshold) are grouped together in clusters separated by “//”. For the second cluster above this is probably heterozygous with some sequencing error, and the first and third clusters are probably homozygous. Again, the simulated data is too clean to get a real picture of how tricky real data can be. Looking again at the Anolis data:

000e3bb624e3bd7e91b47238b7314dc6;size=4;*
TGCATATCACAAGAGAAGAAAGCCACTAATTAAGGGGAAAAGAAAAGCCTCTGATATAGCTCCGATATATCATGC-
75e462e101383cca3db0c02fca80b37a;size=2;-
-GCATATCACAAGAGAAGAAAGCCACTAATTAAGGGGAAAAGAAAAGCCTCTGATATAGCTCCGATATATCATGCA
//
//
001236a2310c39a3a16d96c4c6c48df1;size=4;*
TGCATCTCTTTGGGCTGTTGCTTGGTGGCACACCATGCTGCTTTCTCCTCACTTTTTCTCTCTTTTCCTGAGACT------------------------------
4644056dca0546a270ba897b018624b4;size=2;-
------------------------------CACCATGCTGCTTTCTCCTCACTTTTTCTCTCTTTTCCTGAGACTGAGCCAGGGACAGCGGCTGAGGAGGATGCA
5412b772ec0429af178caf6040d2af30;size=1;+
TGCATTTCTTTGGGCTGTTGCTTGGTGGCACACCATGCTGCTTTCTCCTCACTTTTTCTCTCTTTTCCTGAGACT------------------------------
//
//

Are there truly two alleles (heterozygote) for each of these loci? Are they homozygous with lots of sequencing errors, or a heterozygote with few reads for one of the alleles?

Thankfully, untangling this mess is what step 4 is all about.