Introduction

This document describes the output produced by the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

  1. Get accessions
  • Get Expression Atlas dataset accessions corresponding to the provided species (and optionally keywords) (run by default; optional)
  • Get NBCI GEO microarray dataset accessions corresponding to the provided species (and optionally keywords) (run by default; optional)
  1. Download data
  • Download Expression Atlas data (run by default; optional)
  • Download NBCI GEO data (run by default; optional)
  1. ID Mapping
  • Map gene IDS to NCBI Entrez Gene IDS (or Ensembl IDs) for standardisation among datasets using g:Profiler (run by default; optional)
  1. Data normalisation
  • Normalize RNAseq raw data using TPM (necessitates downloading the corresponding genome and computing transcript lengths) or CPM.
  • Perform quantile normalisation on each dataset separately using scikit-learn
  1. Merge all data
  2. Compute base statistics for each gene, platform-wide and for each platform (RNAseq and microarray)
  3. Compute stability scoring
  • Get list of candidate genes based on base statistics
  • Run optimised, scalable version of Normfinder
  • Run optimised, scalable version of Genorm (NOT run by default; optional)
  • Compute stability scores for each candidate gene
  1. Aggregate results
  2. Prepare Dash Plotly app for further investigation of gene / sample counts
  3. Make MultiQC report

Output files

MultiQC

This report is located at multiqc/multiqc_report.html and can be opened in a browser.

Output files
  • multiqc/
    • MultiQC report file: multiqc_report.html.
    • MultiQC data dir: multiqc_data.
    • Plots created by MultiQC: multiqc_plots.

MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.

Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.

Dash Plotly app

dash_app/: folder containing the Dash Plotly app

To launch the app, you must first create and activate the appropriate conda environment:

conda env create -n nf-core-stableexpression-dash -f <OUTDIR>/dash_app/spec-file.txt
conda activate nf-core-stableexpression-dash

then:

cd dash_app
python app.py

and open your browser at http://localhost:8080

Note

The app will try to use the port 8080 by default. If it is already in use, it will try 8081, 8082 and so on. Check the logs to see which port it is using.

Expression Atlas

Output files
  • public_data/expression_atlas/accessions/: accessions found when querying Expression Atlas
  • public_data/expression_atlas/datasets/: count datasets (normalized: *.normalised.csv / raw: *.raw.csv) and experimental designs (*.design.csv) downloaded from Expression Atlas.

GEO

Output files
  • public_data/geo/accessions/: accessions found when querying GEO
  • public_data/geo/datasets/: count datasets (normalized: *.normalised.csv / raw: *.raw.csv) and experimental designs (*.design.csv) downloaded from GEO.

IDMapping (g:Profiler)

Output files
  • idmapping/
    • Count datasets whose gene IDs have been mapped: *.renamed.csv.
    • Table associating original gene IDs and mapped gene IDs: *.mapping.csv.
    • Gene metadata (name and description): *.metadata.csv.

Normalisation

Output files
  • normalised/: Newly normalised datasets
    • normalised/deseq2/ for DESeq2
    • normalised/edger/ for EdgeR
  • quantile_normalised : Quantile normalised datasets

Gene base statistics

Output files
  • merged_datasets/: Merged count datasets (sample-wide)
    • merged_datasets/all/ : all datasets together
    • merged_datasets/rnaseq/ : only RNA-seq datasets
    • merged_datasets/microarray/ : only microarray datasets

Merged counts

The file containing all normalised counts is bundled as a Parquet file with the Dash Plotly app.

Output files
  • dash_app/data/all_counts.parquet: Merged count datasets (sample-wide)

Summary of gene statistics and scores

The gene stat summary is also bundled with the Dash Plotly app.

Output files
  • dash_app/data/all_genes_summary.csv: file containing all gene statistics, scores and ranked by stability score

Overall experimental design

Output files
  • dash_app/data/whole_design.csv: file containing all experimental design information

Pipeline information

Output files
  • pipeline_info/
    • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
    • Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter’s are used when running the pipeline.
    • Parameters used by the pipeline run: params.json.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.