Analyzing Your Data with Galaxy

This tutorial is meant to introduce some of Galaxy’s basic features. We’ll go over how to upload and download data to Galaxy, how to analyze next-generation sequencing data in Galaxy, and how to visualize that data in a genome browser.

Date: April 8th, 2016
Time: 2:30-5:00pm
Location: Millenium Sciences Complex W 201A

Prerequisites:

A laptop
An account with either the Tadpole instance and / or the Main instance of Galaxy
An up to date version of a Java Runtime Environment
A local copy of the Integrative Genomics Viewer (IGV)

Material:

Tutorial Outline

Introduction to Galaxy
- Slides
Illumina Sequencing
- Short Video
Organizing, Downloading, and Uploading Data
- PlasmoDB & GeneDB
- File Transfer Protocol (FTP)
Short Read Quality Control
- FASTA & FASTQ format
Mapping to a Reference Sequence
- Short Read Mappers
Alignment Quality Control
- SAM format
Identifying Variants
- VCF format
Annotating Variants
- GFF & BED format
Visualizing Alignments
- IGV
Conclusions
Additional Resources

Introduction to Galaxy

First we’ll look at slides by Stephen Turner, the head of the bioinformatics core at the University of Virgina and a former PhD student in Marylyn Ritchie’s lab. A link to the slides can be found above.

Illumina Sequencing

Next we will watch a short primer on modern Illumina short read sequencing:

http://www.illumina.com/systems/hiseq_2500_1500/technology.html

Then we will go over the FASTQ format format, which is how the data is actually stored when a biologist will typically first work with it. In addition, it’s useful to touch on the FASTA format since it came before FASTQ and represents sequence information in a different, but widely used format as well.

Organizing, Downloading, and Uploading Data

Next we will learn how to gather our data to start performing analyses. It’s important we know the sources of our data, what format we need our data in based on the tools we are going to use, that we can move our data from machine to machine without much hassle, and that we name our data appropriately.

Questions we’re interested in here include:

Where can I find the data I need in order to perform my analysis?
How can I move that data around?
How can I upload and download that data to and from Galaxy?

Short Read Quality Control

Next we will look to check the quality of our reads as they come off the sequencer. In order to do this we will run the FASTQC tool. It can be used from the command line as well as found in the Galaxy Toolshed. The great thing about this tool is that as long as your reads are in FASTQ format, FASTQC can generate a quality report for you, regardles of the sequencing platform. The reads do not all have to be the same lngth nor do they have to follow a certain quality encoding. In fact, FASTQC can tell you both of these things, if you do not know ahead of time. It’s a great way to understand what type of sequencing data you’re even looking at in the first place since different data will have different biases.

Questions we’re interested in here include:

What platform was used to perform my sequencing?
Are there any biases I should be aware of?
What proportion of my reads are of high quality?
Is there any detectable contamination?

I would highly recommend consulting Sequencing QC Fail for any additional questions regaring NGS data QC.

Mapping to a Reference Sequence

Next we will take our reads and align to a reference genome we downloaded from PlasmoDB. We will use BWA to accomplish this task. In particular, we will use the BWA MEM algorithm, since it was developed to handle relatively long reads (greater than or equal to 100 base pairs in length)