Bioinformatics

A Beginner's Guide to Bioinformatic Pipelines

What is a Bioinformatic Pipeline? In simple terms, a bioinformatic pipeline is a series of algorithms that aim to ...


What is a Bioinformatic Pipeline?

In simple terms, a bioinformatic pipeline is a series of algorithms that aim to achieve an output from an input.

a bioinformatic pipelineIn the context of genomic research and analysis, a bioinformatic pipeline aims to transform raw data into processed data that can be used by the Biologist who ran the initial experiment. The Biologist requires raw data to be processed so that they can then visualize, discover insights and explore their new findings.

Before we continue, let’s define some bioinformatic terms:

A ‘tool’ on its own is a piece of software. 
A ‘workflow’ defines a specific set of inputs and outputs and can consist of one or more tools. 

Tools are not typically used standalone and are at the very least, wrapped into a workflow consisting of one tool. There could be multiple workflows for a single tool, using different inputs/outputs and command line tool parameters.

Oftentimes, the terms ‘workflow’ and ‘pipeline’ are used interchangeably. If you’d like to dive deeper into understanding the differences between them, please refer to this insightful forum thread.

 

Building a Bioinformatic Pipeline

There are quite a few components involved in building a bioinformatic pipeline. On a high-level, some of these components include multiple inputs from private or public sources (like reference genomes and FastQ reads), various tools, and the identification of output files we want to achieve from those tools. Certain parameters, versions and builds of the command line tool need to be set before any output is achieved. 

 

building a bioinformatic pipeline

 

Popular Command Line Tools

Let’s get you familiar with some popular command line tools that are used to achieve common pipelines: 

Command Line Tool

Description

Input

Output

FASTQC

Evaluates typical QC values for your input data such as read length, adapter contamination, duplicated reads and so on.

FASTQ

FASTQC

STAR

Maps paired FASTQ reads to a reference genome using the STAR aligner.

FASTQ

BAM File & Gene Counts

DESeq2

Identifies genes that are differentially expressed between two groups of gene counts.

Gene Counts

Differential Expression File

BWA

Maps paired FASTQ reads to a reference genome using the Burrows-Wheeler Aligner using the bwa-mem option.

FASTQ 

BAM File

Mutect2

Runs a mutation calling on aligned data.

BAM File

VCF File


Let’s use an example!

Say we have generated some FASTQ files and we want to do a differential expression analysis. 

“The goal of differential expression analysis is to perform statistical analysis to try and discover changes in expression levels of defined features (genes, transcripts, exons) between experimental groups with replicated samples.”
- https://biocorecrg.github.io/RNAseq_course_2019/differential_expression.html

Using some of the command line tools above, we can input the FASTQ files into STAR to generate Gene Counts files. These Gene Counts files can then be inputted into DESeq2 to output a differential expression file.

Like this: 

 

differential expression bioinformatic pipeline

While this is just an overview of what pipelines are and what they can do, there are tons of command line tools that can be used to process your data, depending on your end goal and the resulting output files you want to produce. 

Pipeline Frameworks and Libraries

These command line tools don’t magically stitch themselves together. Bioinformaticians will use pipeline toolkits to orchestrate the management of one command line tool into another by having the output of one command line tool be the input of another. 

For a comprehensive list of pipeline toolkits available, please visit this link.



The BioBox Platform: Pipeline Design Studio

In the BioBox Platform, you can design, configure and run your own bioinformatic pipelines, just like the differential expression pipeline above! 

Pipeline Design Studio does not require pipeline frameworks and libraries to connect workflows. Rather, it allows the seamless integration of one workflow into the next, without any additional coding. To learn more about Pipeline Design Studio, click here.






Similar posts

Join our mailing list

Be the first to know about new and upcoming features for the BioBox Platform, company news, blog posts, and updates.