Introduction

The Transcript Discovery plugin is designed to discover transcripts by mapping RNA-Seq sequencing reads to a genomic reference, allowing large gaps (for introns), followed by a transcript discovery process where transcripts are inferred from the read mappings. Note that the Transcript Discovery tool has been tested to work well with other alignment tools including STAR, TopHat2, GSNAP and HISAT2.

The detection of novel transcripts from short-read sequencing data is only possible with low precision and sensitivity. Therefore these tools are focused on improving existing annotations for non-model eukaryotic species, updating an annotation based on RNA-Seq data and/or generating transcript and gene tracks to serve as a common reference for differential expression analysis using the RNA-Seq Analysis tool.

Best practices

The proposed workflow for using the Transcript Discovery plugin in combination with the existing RNA-Seq tool in CLC Genomics Workbench is:

  1. Run the Large Gap Read Mapping tool using all your RNA-Seq reads and a genomic reference sequence.
  2. Run the Transcript Discovery tool on the resulting read mapping to predict transcripts and genes.
  3. Inspect the results and if necessary re-run the transcript discovery to refine the settings to produce the desired result.
  4. Use the Predicted gene and Predicted Transcript tracks in the existing RNA-Seq tool in the Workbench.

To run an experiment with multiple replicates and tissues, it is possible to supply several Large Gap Read Mappings at once to the tool. These are then processed as one data set. However, you should note that:

For these reasons, we recommend to run all replicates of the same condition together, and to run different conditions sequentially.

For example, if you had 4 "leaf" samples and 4 "root" samples from a plant, then you should run the tool on all 4 "leaf" samples and provide the output transcript track as input for the next invocation of the tool with the 4 "root" samples. You should later remap the samples separately using RNA-Seq Analysis, and prune away any annotations that have little or no expression in all the conditions. Note that this pruning of annotations can be necessary if, for example, the "leaf" data does not support a long transcript so a short one is predicted. However the long transcript is unambiguously present in the "root" data. Revisiting the leaf data after the long transcript is known might show that the long transcript is a good fit here too. The original short transcript might then be pruned away.

Known limitations

The Transcript Discovery has the following known limitations: