Analyzing Your Data with Galaxy

Analyzing your Data with Galaxy

This tutorial is meant to introduce some of Galaxy’s basic features. We’ll go over how to upload and download data to Galaxy, how to analyze next-generation sequencing data in Galaxy, and how to visualize that data in a genome browser.

Date: April 8th, 2016
Time: 2:30-5:00pm
Location: Millenium Sciences Complex W 201A

Prerequisites:

Material:

Tutorial Outline

  1. Introduction to Galaxy
    • Slides
  2. Illumina Sequencing
    • Short Video
  3. Organizing, Downloading, and Uploading Data
    • PlasmoDB & GeneDB
    • File Transfer Protocol (FTP)
  4. Short Read Quality Control
    • FASTA & FASTQ format
  5. Mapping to a Reference Sequence
    • Short Read Mappers
  6. Alignment Quality Control
    • SAM format
  7. Identifying Variants
    • VCF format
  8. Annotating Variants
    • GFF & BED format
  9. Visualizing Alignments
    • IGV
  10. Conclusions
  11. Additional Resources

Introduction to Galaxy

First we’ll look at slides by Stephen Turner, the head of the bioinformatics core at the University of Virgina and a former PhD student in Marylyn Ritchie’s lab. A link to the slides can be found above.

Illumina Sequencing

Next we will watch a short primer on modern Illumina short read sequencing:

http://www.illumina.com/systems/hiseq_2500_1500/technology.html

Then we will go over the FASTQ format format, which is how the data is actually stored when a biologist will typically first work with it. In addition, it’s useful to touch on the FASTA format since it came before FASTQ and represents sequence information in a different, but widely used format as well.

Organizing, Downloading, and Uploading Data

Next we will learn how to gather our data to start performing analyses. It’s important we know the sources of our data, what format we need our data in based on the tools we are going to use, that we can move our data from machine to machine without much hassle, and that we name our data appropriately.

Questions we’re interested in here include:

  • Where can I find the data I need in order to perform my analysis?
  • How can I move that data around?
  • How can I upload and download that data to and from Galaxy?

Short Read Quality Control

Next we will look to check the quality of our reads as they come off the sequencer. In order to do this we will run the FASTQC tool. It can be used from the command line as well as found in the Galaxy Toolshed. The great thing about this tool is that as long as your reads are in FASTQ format, FASTQC can generate a quality report for you, regardles of the sequencing platform. The reads do not all have to be the same lngth nor do they have to follow a certain quality encoding. In fact, FASTQC can tell you both of these things, if you do not know ahead of time. It’s a great way to understand what type of sequencing data you’re even looking at in the first place since different data will have different biases.

Questions we’re interested in here include:

  • What platform was used to perform my sequencing?
  • Are there any biases I should be aware of?
  • What proportion of my reads are of high quality?
  • Is there any detectable contamination?

I would highly recommend consulting Sequencing QC Fail for any additional questions regaring NGS data QC.

Mapping to a Reference Sequence

Next we will take our reads and align to a reference genome we downloaded from PlasmoDB. We will use BWA to accomplish this task. In particular, we will use the BWA MEM algorithm, since it was developed to handle relatively long reads (greater than or equal to 100 base pairs in length)

Questions we are interested in here include:

  • What are my read lengths?
  • What analyses do I want to perform downstream of mapping my reads?
  • What tools are available to map my reads to a reference?

Alignment Quality Control

Next we will evaluate our mapping results and further determine whether we sequenced what we expected to sequence. There are several ways to assess the quality of your data after you’ve aligned your reads. There are standard ways to assess the quality such as calculating the genome or transcriptome wide coverage distribution and there are also non-standard ways such as analyzing all of the unmapped reads to understand why they didn’t map and where they possibly belong. All of this information can be obtained from the alignment file, or the Sequence Alignment / Map (SAM) Format file.

Questions we are interested in here include:

  • How many of my reads map wher I expect them to?
  • How many of my reads map to potential contaminants and what are those contaminants?
  • What does the genome-wide coverage distribution look like?
  • Are there any noticeable biases?

Identifying Variants

Next we will look to identify variants in our sample. This includes anything from Single Nucleotide Polymorphisms (SNPs) to Structural Variants (SVs). This information is commonly stored in yet another format called the Variant Call Format (VCF). There are several tools available in Galaxy that we can use that include SAMTools, VarScan, and FreeBayes.

Questions we are interested in here include:

  • What kinds of variants am I interested in?
  • How many reads do I need to support detected variants?
  • What tools are available to detect these different types of variants?

Annotating Variants

Next we will look to process and annotate the variants we identified in our sample. In order to do this we need to have a the FASTA reference file that we mapped our reads to and what would also help is an annotation file. Annotation or feature files can be stored in various formats, but the two most common are the Gene Feature Format (GFF) and Browser Extensible Data (BED) Format. An additional format worth knowing about that combines sequence and annotation information into one file is known as EMBL, but I wouldn’t recommend using that one because it isn’t easy to parse.

Questions we are interested in here include:

  • What type of variants do we care about?
  • Is there a particular region of the genome we are interested in?
  • What are the effects of each variant?

Visualizing Alignments

Finally we will look to visualize the variants we found in a Genome Browser. For this exercise we will use IGV. A link to download IGV can be found above, under prerequisites First we will have to download our BAM, BAM Index, and VCF files from Galaxy. Next we will open IGV and upload a genome and annotation file. Finally, we will upload the alignment and variant files and see our what our alignment looks like, up close and personal.

Questions we are interested in here include:

  • What genome browser is best for the type of data I’m looking at?
  • Should I view this in a remote or a local resource?
  • What data formats does my genome browser support?
  • Does the browser manipulate the data at all by default?

Reproducibility in Galaxy: Histories and Workflows

Once we’ve finished our analysis, it’s likely we want to either reuse the same set of tools in the same order on similar data, share the workflow with collaborators, or publish the workflow for a publication. In order to do this Galaxy gives us several ways of doing this.

In Galaxy we can:

  • Publish & save workflows
  • Publish & save histories
  • Publish & save visualizations

Conclusions

  • There is no excuse for computational irreproducibility - all of the resources are there; use them.
  • Don’t be tool-centric. Focus on what the tool does and how it does it; pick tools that fit nicely together into a workflow, that you understand, and that are well documented as opposed to ones that:
    1. Look nice in a publication
    2. That are used most widely by the community
    3. That work the fastest
  • Just like molecular biology, computational biology is a trial and error process - get used to playing around with different tools to answer your questions.
  • Move away from black box analyses - don’t ignore the hidden complexity of your data; understand it.

Additional Resources

Visit the resources page of the website!


Updated 2016-04-08