This repository serves to document and make available to the community the code of the publication 'To join or not to join: handling biological replicates in long-read RNA sequencing data'.
The pre-print will be shortly submitted to bioRxiv once it is submitted to a journal for review.
The paper focusses on investigating strategies on combining long-read RNAseq data from multiple biological replicates for transcriptome reconstruction. We investigate 2 strategies: "Join & Call" (J&C), where reads from all replicates are combined before performing transcriptome reconstruction, and "Call & Join" (C&J), where transcriptome reconstruction is performed on each replicate individually before combining the resulting annotations. We compare IsoQuant, FLAIR, Bambu, and TALON on both PacBio and ONT data, as well as Mandalorion and IsoSeq + SQANTI3 Filter on PacBio data only, using a data set of mouse brain and kidney tissue with 5 biological replicates per tissue.
The data used in this study has been submitted to the European Nucleotide Archive (ENA; https://www.ebi.ac.uk/ena/browser/home). Mice brain and kidney data generated using PacBio sequencing are accessible under accession number PRJEB85167 and PRJEB94912, respectively.
TODO: Add ONT and Illumina accession numbers
The code is organized in a nextflow pipeline which runs a specified transcriptome reconstruction tool (out of the above mentioned) with both strategies and on both brain and kidney tissues on a specified data type (ONT or PacBio, where compatible).
The scripts used by the pipeline are specifically designed to be run on a SLURM cluster and will not be compatible with other environments out-of-the-box.
There are further options (e.g. not using supporting short-read data for FLAIR, or running partial joins with 2,3,4 samples, etc.) which are not used for the analyses in the paper.
Under /src/util/conda_envs, .yaml files to configure the needed conda environments can be found.
Under /src/util/tool_setup, instructions for cloning the repositories of tools which need to have a local copy can be found.
Examples of how to use the SLURM-wrapper nextflow_wrapper.sbatch script to launch the main_workflow.nf:
- Run FLAIR with supporting short reads on ONT data:
sbatch nextflow_wrapper.sbatch --data ont --algorithm flair --stringent true --use_sr true --sr_config star --result_name ont/flair_ar_sr/run1
- Run IsoQuant on PacBio data:
sbatch nextflow_wrapper.sbatch --data isoseq --algorithm isoquant --result_name isoseq/isoquant/run1
/src/util contains utility scripts, including the aforementioned environment and tool setup.
/src/data_preparation_scripts contains scripts to set up the data, including creating the concatenated fastq files needed for the J&C strategy.
/src/nextflow contains the nextflow pipeline, separated into the following subdirectories:
/src/nextflow/modulescontains the .nf files defining the modules of different steps./src/nextflow/scriptscontains the actual .sh .sbatch scripts used by the modules for execution on the SLURM cluster./src/nextflow/workflowscontains a variety of workflows, the primary one ismain_workflow.nf. On a SLURM cluster, this workflow is run through thenextflow_wrapper.sbatchscript.
/src/plotting contains scripts to create the plots used in the paper from the results generated by the workflows.
/src/resouce_inspection contains the scripts used to obtain runtime and memory usage information from the SLURM jobs.
/reports/empty_report contains the basic directory structure for the reports created by the nextflow workflow along with some helper scripts.
For questions about this code and its reuse or adaptation, please use the GitHub issues or contact me through [email protected].