View on GitHub

TM3'seq

Tagmentation-Mediated 3'-enriched RNA-seq

| Home | Analysis Pipeline | mRNA Extractioni Protocol | TM3’seq Protocol | Liquid-Handling Robot Protocols |

TM3’seq Data Analysis Pipeline

The workflow is written using Snakemake. Dependencies are installed using Bioconda.

Overview

This workflow was designed to streamline the analysis of TM3’seq data. However, it can be used to process FASTQ files derived form any other RNA-seq protocol once the samples have been demultiplexed. The details of demultiplexing may vary depending on the protocol used for library preparation.

Starting with FASTQ files, the workflow 1) trims raw reads, 2) aligns them, and 3) counts the number of reads mapping to each gene. The output is a gene counts file that can be imported in standard software for the analisys of RNA-seq data.

Inputs

Outputs

Workflow

  1. FASTQ summary and QC metrics - Use FastQC to determine some basic QC metrics from the raw fastq files
  2. Trim reads - Use Trimmomatic to trim off adapter and low quality sequence from the ends of reads
  3. Align reads - Use STAR to aign reads to the genome, accounting for known splice junctions
  4. Deduplicate (optional) - Remove duplicates using nudup which utilizes unique molecular identifiers
  5. Count - Use featureCounts (part of the Subread package) to quanyify the number of reads uniquely mapped to each gene
  6. Summarize - Combine the count files and run MultiQC to generate a summary report

Install prerequsites

  1. Install conda

    • If you have Anaconda installed, you already have it.
    • Otherwise, install the Miniconda package.
  2. Enable the Bioconda channel

Setup environment and run workflow

  1. Clone workflow into working directory

    git clone <repo> <dir>
    cd <dir>
    
  2. Input data

    Place demultiplexed fastq.gz files in a data directory

  3. Edit configuration files as needed

    cp config.defaults.yml myconfig.yml
    nano myconfig.yml
    
    # Only if running on a cluster
    cp cluster_config.yml mycluster_config.yml
    nano mycluster_config.yml
    
  4. Install dependencies into an isolated environment

    conda env create -n <project> --file environment.yml
    
  5. Activate the environment

    source activate <project>
    
  6. Execute the workflow

    snakemake --configfile "myconfig.yml" --use-conda 
    

Common options

See the Snakemake documentation for a list of all options.

Examples

Dry run of the workflow with five samples

snakemake --configfile "myconfig.yml" --dryrun
Job counts:
        count   jobs
        1       all
        1       combined_counts
        5       count
        5       fastqc
        1       multiqc
        5       star_align
        1       star_genome_index
        5       trimmomatic
        24

Running workflow on a single computer with 4 threads

snakemake --configfile "myconfig.yml" --use-conda --cores 4

Running workflow using SLURM

snakemake \
    --configfile "myconfig.yml" \
    --cluster-config "mycluster_config.yml" \
    --cluster "sbatch --cpus-per-task={cluster.n} --mem={cluster.memory} --time={cluster.time}" \
    --use-conda \
    --cores 100

Running workflow on Princeton LSI cluster using DRMAA.

Note: When using DRMAA you may need to export the DRMAA_LIBRARY_PATH. On the LSI cluster this can be done by running module load slurm.

module load slurm
snakemake \
    --configfile "myconfig.yml" \
    --cluster-config "cetus_cluster_config.yml" \
    --drmaa " --cpus-per-task={cluster.n} --mem={cluster.memory} --qos={cluster.qos} --time={cluster.time}" \
    --use-conda \
    --cores 1000 \
    --output-wait 60