Skip to the content.

RADCamp NYC 2023 Part II (Bioinformatics)

Day 1 (AM)

Overview of the morning activities:

Welcome and Introductions

Instructor introduction slides

Learning objectives.

By the end of this workshop you will gain experience with:

Brief intro to RADSeq

Lead: Deren

Intro to CLI and FASTQ

Lead: Isaac

Accessing a command line interface on CodeOcean

Our first goal will be to use a command line interface to view RAD-seq data as way to become familiar with the format of the raw data that we will analyze, while also learning about basic command line programs. For this, we will connect to CodeOcean as a remote server. For the moment, to stay focused on the topic of RAD-seq and the ipyrad assembly process, we will pop right to the command line, but we will hear much more about the unique features of the CodeOcean platform after lunch.

Get everyone on CodeOcean here:

The commands you type in this terminal are not run on your own computer, they are run on a 16 core virtual machine somewhere out in the ether: png

Each grey cell in this tutorial indicates a command line interaction. Lines starting with $ indicate a command that should be executed in a terminal on the CodeOcean capsule, for example by copying and pasting the text into your terminal. Elements in code cells surrounded by angle brackets (e.g. ) are variables that need to be replaced by the user. All lines in code cells beginning with \#\# are comments and should not be copied and executed. All other lines should be interpreted as output from the issued commands.

## Example Code Cell.
## Create an empty file in my home directory called `watdo.txt`
$ touch ~/watdo.txt

## Print "wat" to the screen
$ echo "wat"
wat

Here we’ll use bash commands and command line arguments. If you have trouble remembering the different commands, you can find some very useful commands on this cheat sheet. Take a look at the contents of the folder you’re currently in.

$ ls

There are a bunch of folders. To keep things organized, we will create a new directory which we’ll be using during this Workshop. Use mkdir. And then navigate into the new folder, using cd.

$ cd /scratch
$ mkdir ipyrad-workshop
$ cd ipyrad-workshop

NB: A word about the behavoir of different CO directories.

First view of FASTQ data

Goals of this module:

For this exercise we will use one sample from an Amaranthus dataset which is also 3RAD. We will download some of these data, using the command wget. Make sure that you are in the ipyrad-workshop folder you just created. Since this is paired end data, you’ll need to grab both R1 and R2 files.

$ wget wget https://github.com/radcamp/radcamp.github.io/raw/master/NYC2023/datafiles/Amaranthus_R1_.fastq.gz 
$ wget wget https://github.com/radcamp/radcamp.github.io/raw/master/NYC2023/datafiles/Amaranthus_R2_.fastq.gz

Now, we will use the zcat command to read lines of data from this file and we will trim this to print only the first 20 lines by piping the output to the head command. Using a pipe (|) like this passes the output from one command to another and is a common trick in the command line.

Here we have our first look at a fastq formatted file. Each sequenced read is spread over four lines, one of which contains sequence and another the quality scores stored as ASCII characters. The other two lines are used as headers to store information about the read.

$ zcat Amaranthus_R1_.fastq.gz | head -n 20
@NB551405:60:H7T2GAFXY:1:11101:24090:2248 1:N:0:TATCGGTC+CAACCGGG
TTAGGCAATCGGTTATGAGGTTTACGAACAGGTTAAAGGAGTTGAAACTATATTTGGTAAAACAGGACAAGTGCAAGGGG
+
AAAAAEEEEE/EEEAE/AEEEEEEEEEEEEEEEE/EEEEEEEEEAEEEEEEA/EEE<E/EEAEE<EEEEEEEEEEEE<AE
@NB551405:60:H7T2GAFXY:1:11101:4371:2248 1:N:0:TATCGGTC+GTACCAAA
AACTCGTCATCGGCTACATGTGCTATTATCATTGCCATTTATTCTCCTTGAAGTGCACAAACCAGATTGTCTTGTGCTTA
+
AAA/AAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEAE<EEEAEEEAE/EAEAEE/EAEEEEEEEEEEEE
@NB551405:60:H7T2GAFXY:1:11101:6626:2248 1:N:0:AACCTCCT+CAGGTGAA
GGTCTACGTATCGGCCTCCATCCGATTCTGTTGTTGGTACTTTGACTTTCATTGTCACGTTTTAAAACTTTGACCACTAT
+
AAAAAEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEE
@NB551405:60:H7T2GAFXY:1:11101:18661:2248 1:N:0:AAGTCGAG+GGCGATAA
GGTCTACGTATCGGGCCTAGATTTCCCTAGTTAACAATGGTGGAATGAAATTGAATTGATTAAGCAGGAGGAAAAGGATG
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@NB551405:60:H7T2GAFXY:1:11101:18275:2248 1:N:0:CATTCGGT+AACACAGG
TCATGGTCAATCGGTTCATGCTAAACACAATTTCAGAAGTAGCTGTTGAAAGAAGATACATAAAATATAATAGAGATACA
+
/AAA//EAEE/EEEEEEEEE/EEEEEE<EEEEAEEAEEEEEAEEAEEAEEEEEEEEEEEEEAEEEEEAEE<AEEEEEEAE

The first is the name of the read (its location on the plate). The second line contains the sequence data. The third line is unused. And the fourth line is the quality scores for the base calls. The FASTQ wikipedia page has a good figure depicting the logic behind how quality scores are encoded.

The pair of sequences at the end of each header line (TATCGGTC+CAACCGGG) are Illumina’s i7 and i5 read sequences. The libraries you created/will be analyzing used the i7 as the participant identifier and the i5 as the PCR duplicate identifier (unique molecular index). So you should see the same i7 across all reads in your fastq file but different i5 sequences across different reads of the fastq file.

A few activities to work through on your own (or in small groups)

Coffee break (10:30-10:50)

ipyrad history, philosophy, and workflow

Lead: Deren

ipyrad CLI simulated data assembly

Lead: Isaac

Exercise: ipyrad command line assembly with simulated data

Break for lunch (12:45-1:30)