Skip to content

sid-sethi/PSQAN

Repository files navigation

PSQAN - a pipeline to prioritise novel and biologically relevant transcripts from long-read RNA sequencing

Snakemake GPLv3 license Tests

Table of contents

Overview

Introduction

Despite the advances in tools to process long-read RNA-sequencing (lrRNA-seq) data, the downstream analysis of transcriptional data remains challenging due to the detection of thousands of novel transcripts and the lack of tools to prioritise functionally important transcripts. From such a large number of transcripts, it is difficult to distinguish between stable transcripts of potential biological importance, partially processed RNAs and splicing noise. Furthermore, when using lrRNA-seq to identify rare and novel transcripts, the recommendation is to incorporate multiple replicates in the study design and implement transcript-level filters. However, determining optimal expression thresholds for filtering and selecting transcripts which are reproducible across samples remains a significant challenge. Consequently, researchers find it challenging to interpret lrRNA-seq data effectively and generate relevant hypothesis which could be experimentally validated in the laboratory.

PSQAN (Post-transcriptomic Structural Quality Assessment and Normalisation) is a Snakemake workflow designed to help researchers identify high-confidence transcripts associated with candidate genes. PSQAN performs a gene-based analysis on characterised transcripts generated by SQANTI3 and TALON. PSQAN normalises transcript expression per gene and re-groups transcripts into actionable categories to support transcript prioritisation, hence making the results more interpretable. PSQAN generates visualisations to help users determine optimal expression thresholds for detecting both known and novel transcripts of probable biological importance. Furthermore, PSQAN allows users to apply multiple transcript level expression thresholds, both to per sample and across all samples. Lastly, PSQAN generates visualisations and an HTML report, enabling users to explore the known and novel transcripts expressed by a gene, alongside their transcript categories and transcript expression. An example of the report generated by PSQAN for a single gene can be downloaded here.

Description

Input data

PSQAN can be used with the transcript characterisation output of either SQANTI3 or TALON, which are the two most prominently used tools in long-read RNA-seq data analysis. PSQAN takes the output produced by SQANTI3 or TALON as input, along with a list of candidate genes to analyse. For each gene, PSQAN extracts the isoforms associated with the gene from the output generated by SQANTI3/TALON. Since the filtering steps in SQANTI3 and TALON are optional and may be skipped, PSQAN applies its own filtering criteria prior to processing to ensure the removal of potential genomic contamination and rare PCR artifacts. PSQAN removes isoforms with a high percentage of genomic "A"s in their downstream 20 bp window (80% is the default), or if one of its junctions is predicted to be a template switching artifact (tagged as "RTS_stage" by SQANTI3).

Note: Output of TALON does not contain all the transcript-level descriptors required by PSQAN. As a result, certain PSQAN processes are skipped when using TALON output. The processes performed by PSQAN for SQANTI3 and TALON are summarised below:

PSQAN process SQANTI3 TALON
Filtering internal priming artifacts Yes Yes
Filtering template switching artifacts Yes No (missing required data)
Normalising transcript expression Yes Yes
Isoform re-categorisation Yes No (missing required data)
Transcript-level filtering Yes Yes
Visualisations Yes Yes

Normalising transcript expression per gene

PSQAN calculates the normalised full-length reads for each transcript (NFLRTi), which quantifies transcript expression as the percentage of total gene transcription. This normalisation emphasises transcript usage relative to overall gene output, thereby simplifying interpretation. For instance, a transcript with an NFLR value of 10.0 would imply that it accounts for 10% towards all transcripts generated from the gene locus. PSQAN’s normalisation also removes variation due to overall gene expression differences between samples, hence making comparisons of transcript usage independent of absolute gene expression.

Given a transcript T in sample i with FLR as the number of full-length reads mapped to the transcript T, PSQAN calculates the normalised full-length reads (NFLRTi) as:

where, NFLRTi represents the normalised full-length read count of transcript T in sample i, FLRTi is the full-length read count of transcript T in sample i and M is the total number of transcripts identified to be associated with the gene after filtering. Finally, to summarise the expression of a transcript associated with a gene across multiple samples (if any), PSQAN calculates the mean of NFLRTi across all the samples:

where, NFLRT represents the mean expression of transcript T across all samples and N is the total number of samples.

Note: PSQAN's normalisation approach has certain limitations that should be considered. First, because this metric does not account for sequencing depth, it may introduce bias if samples with very low coverage are analysed alongside high-coverage samples. Second, it may not be suitable on its own for differential expression analysis at either the gene or transcript level. Third, transcript usage derived from this metric cannot be compared across different genes, as each gene is normalised independently.

Isoform category re-grouping

If PSQAN is used with the output of SQANTI3, it also performs isoform re-grouping into categories which are easy to interpret and facilitates prioritising potentially relevant transcripts. Using the open reading frame (ORF) prediction, nonsense-mediated decay (NMD) prediction and structural categorisation (based on the comparison with reference annotation) of SQANTI3, PSQAN groups the identified isoforms into the following seven categories:

  • Non-coding novel - if predicted to be non-coding and not a full-splice match with the reference
  • Non-coding known - if predicted to be non-coding and a full-splice match with the reference
  • NMD novel - if predicted to be coding and NMD, and not a full-splice match with the reference
  • NMD known - if predicted to be coding and NMD, and a full-splice match with the reference
  • Coding novel - if predicted to be coding and not NMD, and not a full-splice match with the reference
  • Coding known (complete match) - if predicted to be coding and not NMD, and a full-splice and untranslated region match with the reference
  • Coding known (alternate 3'/5' end) - if predicted to be coding and not NMD, and a full-splice match with the reference but with an alternate 3’ end, 5’ end or both 3’ and 5’ end.

Transcript-level filtering

PSQAN implements filtering strategies which provides flexibile and data-driven refinement:

  1. a minimum threshold on transcript expression per sample (NFLRTi)
  2. a minimum threshold on the mean expression across all samples (NFLRT) - not applicable if data has only one sample
  3. a minimum percentage of samples in which a transcript must meet the minimum per sample expression threshold - not applicable if data has only one sample

Furthermore PSQAN provides a visualisation of the number of detected transcripts as a function of varying expression thresholds, showing the number of transcripts which will be retained at every NFLRT threshold. This plot enables users to visually inspect and determine an appropriate NFLRT threshold for retaining high-confidence transcripts. In datasets with multiple samples, PSQAN generates a NFLRTi curve per sample, allowing researchers to examine the variability in transcript detection across samples. This visualisation supports informed decision-making regarding: (a) the minimum expression value at which a transcript should be expressed within each sample; and (b) the minimum number of samples in which a transcript must meet the minimum expression threshold to be considered reproducible.

Getting Started

Dependencies

  • Conda, Miniconda or Mamba (preferred)
  • The rest of the dependencies (including snakemake) are installed via conda using the environment.yml file

Installation

Clone the directory:

git clone --recursive https://github.com/sid-sethi/PSQAN.git

Create conda environment for the pipeline which will install all the dependencies:

cd PSQAN
conda env create -f environment.yml --prefix psqan_venv

Input

  • SQANTI output (classification.txt) or TALON output (read_annot.tsv)
  • A file containing Gene IDs of genes of interest to analyse

Example of gene id file:

gene_id
GeneID1
GeneID2
GeneID3

Practical Suggestions

  • Please see the note on PSQAN limitations under the normalisation section. We recommend using PSQAN on samples with comparable sequencing depth.
  • PSQAN can first be run using the default filtering thresholds to generate pre-filtering visualisations. After examining these visualisations, users can determine suitable thresholds and rerun PSQAN with the updated parameters.

Usage

Modify config.yml to configure this workflow according to your dataset and parameters.

Parameter Description
workdir [STRING] [REQUIRED] Path of working/project directory. This directory should exist. All PSQAN results would be saved within this directory
abundance [FILE] [REQUIRED] Transcript abundance file. Abundance output file from SQANTI and TALON are accepted by PSQAN. For SQANTI, the file ending with "classification.txt" should be provided, while for TALON, read annotation file ending with "read_annot.tsv" should be used
abundance_type [STRING] File type of transcript abundance input file. Accepted choices: ["SQANTI", "TALON"]; [default = "SQANTI"]
gene_ids [FILE] [REQUIRED] A file containing Gene IDs of genes of interest to run PSQAN. Gene IDs column name should be gene_id. If the file contains more than one column, it should be a tab-seperated file - other columns in the file will be ignored
percentage_A_downstream_TTS [NUMERIC] [0-100] Maximum percent of genomic "A"s allowed in the immediate downstream window of the read/isoform. This helps to filter out internal priming artifacts. If percent of genomic "A"s is high (say > 80), the 3' end site of the isoform is probably not reliable. If using TALON abundance file, this percentage is divided by 100 to convert into a fraction. [default = 80]
multisample [BOOLEAN] Do you have transcript abundance of more than one sample/replicate in abundance file? [default = false]
min_exp_perSample [NUMERIC] [0-100] Minimum value of normalised expression required for a transcript PER sample (or replicate). If $multisample=false$, transcripts with normalised expression $>=$ min_exp_perSample would be retained. If $multisample=true$, this parameter is combined with min_sample_perc and transcripts which pass the min_exp_perSample threshold in at least min_sample_perc % of samples are retained. [default = 0.3]
min_exp_mean [NUMERIC] [0-100] Minimum value of mean normalised expression ACROSS all samples required for a transcript. If $multisample=false$, this parameter is ignored. [default = 0]
min_sample_perc [NUMERIC] [0-100] Minimum % of samples which should pass the min_exp_perSample threshold. For instance, min_sample_perc=30 and min_exp_perSample=10 would mean that transcripts identified with a minimum normalised expression of 10 in at least 30% of the total samples would be retained. If $multisample=false$, this parameter is ignored. [default = 0]

TIP: Run PSQAN with default values and use the NFLR_curve plots to identify optimal values of min_exp_perSample, min_exp_mean and min_sample_perc for your dataset.

To run the workflow, use the following commands from the root directory:

conda activate psqan_env
snakemake -j <num_cores> all

The number of cores control how many genes are analysed in parallel. After the snakemake process runs successfully, you can build a html report

snakemake --report <report_name>.html

You may also use the Makefile to run PSQAN instead of snakemake commands. Note that conda environment is automatically activated in the make commands.

make run_psqan    # runs with -j=10 parameter, i.e. 10 cores
make run_report

Output

An example of the workflow output can be viewed here.

Directory structure

working directory  
|--- config.yml                # copy of the parameters used in the pipeline  
|--- report.html               # if snakemake report is generated at the end of the run
|--- Gene_A/  
     |--- pre-filtering/       # plots generated before performing filtering  
     |--- post-filtering/transcriptsRanked.txt      # plots generated after performing filtering   
     |--- logs/              
     |--- gene_normalised_abundance.txt              
     |--- filtered_transcripts.txt              
     |--- NFLR_curve.png              
     |--- NFLR_curve_values.txt          
|--- Gene_B/  
|--- Gene_C/  

Visualisations

PSQAN generates multiple visualisations to aid in the interpretation of results. In addition to visualising transcripts across varying expression thresholds, PSQAN plots the number of transcripts detected in each isoform category and their normalised expression, both before and after filtering. For datasets with multiple samples, PSQAN also computes variability across the samples as standard deviation, which is displayed as error bars. PSQAN displays all transcripts associated with a gene ranked by their normalised expression and coloured by transcript category, allowing users to easily identify dominant transcripts of a gene. Lastly, PSQAN provides an option to generate a gene-level HTML report, compiling all visualisations to facilitate result interpretation.

Transcript detected vs. expression

Number of transcripts detected as a function of varying expression thresholds. This plot can be used to determine the minimum expression threshold to identify high-confidence transcripts.

Transcript count

Number of transcripts detected grouped by transcript categories.

Transcript expression

Expression of transcripts in each transcript category.

Ranked transcripts

Transcripts ranked according to their normalised expression and coloured based on transcript category.

Other useful commands

You can do a dry run (using -n parameter) to view what would be done by the pipeline before executing the pipeline.

snakemake -n all

To exit a running snakemake pipeline, hit ctrl+c on the terminal. If the pipeline is running in the background, you can send a TERM signal which will stop the scheduling of new jobs and wait for all running jobs to be finished.

killall -TERM snakemake

To deactivate the conda environment:

conda deactivate

Licence

Copyright 2024 Astex Therapeutics Ltd.

This repository is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This repository is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the LICENSE file (GNU General Public License) for more details.

About

Post-transcriptomic Structural Quality Assessment and Normalisation of long-read RNA sequencing data

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published