HBC bio photo

HBC

Bioinformatics support team at the Harvard School of Public Health. Focus on research computing, NGS and functional analysis.

Email Github

Resources

Overview

Galaxy

The workshop sessions rely on Galaxy, an open-source framework for biomedical data analysis developed by the Galaxy Team at Penn State. We will work you through a few basic workflows, but the Galaxy wiki has dozens of additional tutorials and screencasts that are worth exploring.

Galaxy installation for the course

During the courses and workshops we will be using a CloudMan installation of Galaxy running in Amazon’s AWS environment. You can reach the Galaxy instance at http://23.23.134.25/.

Remember to register an account during your first visit as this will enable personal histories, sharing of data and much more.

Public galaxy installations

After the course you can pick from an extensive list of public Galaxy instances. Note that most public servers have strict limits on how much data you can store (a quota) which can be somewhat limiting when it comes to NGS analysis.

Deploying your own Galaxy instance

While you can analyze smaller data sets on any of the public Galaxy instances most NGS projects tend to run into quota problems sooner rather than later. For that reason it might make sense to set up your own Galaxy instance. We have summarized the different options to set up your own Galaxy instance, but do not hesitate to contact us with questions.

Additional training materials

Other than the already mentioned screencasts and tutorials on the main Galaxy site a host of other groups are using Galaxy for training purposes. A few notable highlights include:

As useful as Galaxy can be, at some point you will want to branch out to using the command line and creating some basic scripts to make your life easier. Take a look at the Unix-sections of the Perl and Unix Primer for Biologists which should familiarize yourself with the shell and basic file manipulation skills. Once you are comfortable navigating around your folders try to run some of the tools you used in Galaxy on the command line – the FASTQC analysis might be a good starting point.

From there, take a look at some of the swiss army knives of NGS analysis:

  • The FastX Toolkit to manipulate reads
  • BioPieces with dozens of little helpers
  • BEDTools to slice and dice BED, BAM or VCF files (merge, intersect, sort, etc.)

This maturity in tool development is driven by a rapid convergence towards a small list of minimal standards in order to allow a more modular design of workflows as well as to facilitate data exchange between components, most of which you have encountered during the last sessions:

  • FASTQ, Sequence and quality information
  • SAM/BAM, Alignments
  • VCF, Variants

These standards and other existing software frameworks facilitate the development of sequence analysis environments such as the Broad’s Genome Analysis Toolkit, eventually allowing you to mix and match your workflows as needed.

Finally, Software Carpentry: one of the best resources on the web to get you started with programming. Whether you use Python, Ruby, Perl or some other programming language to explore your first basic scripts, do set aside a weekend to work the Carpentry’s materials. While the best practices listed might seem overwhelming in the beginning they are kept fairly simple and will be absolutely invaluable down the road.

NGS resources

You cannot go wrong by simply following the workflows outlined by large-scale genomic papers coming from any of the big sequencing centers, although this frequently requires delving through the supplementary material and online information. For standard tasks the Galaxy tutorial workflows and the SeqAnswers best practices are also a great starting point. Beyond these starting points, one of the best resources for NGS questions is the SeqAnswers forum and Wiki. Take the time to read through the available information and make use of the search function. The BioStar website can also be worth exploring, but has a lower signal/noise ratio.

Remaining current

Staying up-to-date is an ongoing challenge in research. Rather than provide a list of papers and reviews that will likely be outdated within months I strongly recommend following Stephen Turner’s approach which is almost completely identical to mine. Identify blogs, twitter accounts and websites that are relevant, subscribe to their RSS feeds and spend twenty minutes each day sifting through them.

Additional training material and workshops

The MSU summer course is terrific and in addition makes all material available for those who could not attend. Also, most of the material of Birt’s NGS course are accessible. Other centers are making their training material available; among these the course material from the UC Davis Bioinformatics Core really stand out.