Skip to content

A pipeline for isoform-level transcriptome assembly and lncRNA discovery from long-reads [AVAILABLE AND RUNNING, BUT WORK IN PROGRESS]

License

Notifications You must be signed in to change notification settings

integrativebioinformatics/longnoncoder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

longnoncoder logo

(23/12/25) The pipeline is available and working, but we are still working on the documentation, cleaning this repo, etc. Currently we are trying to fix a bug in the tx_annotation/subset_gtf module due to the generation of empty files. But this bug does not affect the pipeline execution at all, it is just some extra information.

Nextflow run with singularity Launch on Seqera Platform Open in GitHub Codespaces

Introduction

integrativebioinformatics/longnoncoder is a bioinformatics nextflow pipeline that provides a comprehensive analysis of raw long-read RNA-seq data, encompassing transcriptome assembly, quantification, and characterization. The pipeline reports a detailed overview on the entire transcriptome with particular emphasis on lncRNA structure and isoforms across annotated transcripts and novel candidates.

LongNonCoder is compatible Ensembl reference genomes and annotations from the following organisms: Homo sapiens, Mus musculus, Danio rerio, Anolis carolinensis, Chrysemys picta belli, Eptatetrus burgeri, Gallus gallus, Latimeria chalumnae, Monodelphis domestica, Notechis scutatus, Ornithorhynchus anatinus, Petromyzon marinus, Sphenodon punctatus, and Xenopus tropicalis. In the next releases, we plan to update the pipeline workflow to cover more organisms or even more general taxonomic classes.

The workflow

longnoncoder workflow

We can describe each step of the workflow as follows:

  1. Quality control of reads (NanoComp)
  2. Filtering and trimming (chopper)
  3. Mapping to a genome reference (minimap2 and samtools)
  4. Quality control of mapped reads (NanoComp)
  5. Transcriptome Assembly (Bambu)
  6. Compare novel transcripts to the annotation reference (GffCompare)
  7. Convert novel transcripts GTF file to FASTA (GffRead)
  8. Predict transcripts as protein-coding or non-coding (RNAmining)
  9. Gather all data from previous steps and generate informative and re-usable metadata .csv and GTF files for both novel and annotated transcripts (Metadata handling)
  10. Provide a report and data visualization for the full transcriptome, with emphasis on lncRNAs (Report)
  11. Gather all possible QC information from the previous steps (MultiQC)

Usage

[!NOTE] If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data. The pipeline is compatible with both Docker and Singularity.

You can run an example test by following the instructions:

Enter the test_data folder

cd test_data

Download and unzip the reference FASTA and GTF files, and also download the fastq.gz files:

Make the file executable!!

chmod +x download-ref.sh

Run it

./download-ref.sh

Add YOUR full path for the samples in the samplesheet.csv (file). For example, your full path for a sample could be:

home/user/longnoncoder/test_data/thesample.fastq.gz

Go back to the main directory and execute the test!

cd ..
nextflow run main.nf -profile test,singularity -params-file test_data/testing.yml

[!WARNING] Please provide pipeline parameters via the CLI or Nextflow -params-file option and input a yaml parameters file. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

For more details and further functionality, please refer to the usage documentation.

Pipeline output

To see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page. For more details about the output files and reports, please refer to the output documentation.

Credits

integrativebioinformatics/longnoncoder was originally written by Bárbara Borges and Lucas Freitas.

We thank the following people for their extensive assistance in the development of this pipeline:

João Cavalcante

Gleison Azevedo

Rodrigo Dalmolin

Thaís Gaudencio

Vinícius Maracajá-Coutinho

Institutions involved

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

About

A pipeline for isoform-level transcriptome assembly and lncRNA discovery from long-reads [AVAILABLE AND RUNNING, BUT WORK IN PROGRESS]

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •