Adding structure prediction modes to nf-core/proteinfold
This section provides guidance on adding new structure prediction modes, implemented via the --mode option, to nf-core/proteinfold.
Contributing
One of the great advantages of an nf-core pipeline is that the community can extend workflows to add new functionalities. In nf-core/proteinfold, this allows adding new protein structure prediction modules as they are released, while still leveraging the existing workflow infrastructure and reporting.
Please consider writing some code to become a nf-core contributor and expand the pipeline! Reach out to a maintainer of contributor for guidance :)
We are all contactable at the #proteinfold_dev nf-core Slack channel. That’s the best place for person-to-person discussions over new additions to implement into the pipeline.
Locating pipeline sections
main.nf: This kicks off each--mode’s workflow once the databases have been prepared on the deployment infrastructure. Relevant parameters are passed fromparams.[mode_name](largely populated from globalnextflow.configparamswhich inheritsdbs.configdatabase locations) through to the[MODE_NAME]()workflow. The channels returned contain the relevantreport_inputmetrics, thetop_rank_model(i.e. the best structure from all inference runs), and standard software versioning info.subworkflows: largely used for mode-specific smaller set-up worklows, except for thepost_processingsubworkflow which will be detailed later.workflows/[mode_name].nf: the--mode’s workflow handles input channels of relevant databases, passes them to the local module that does the prediction work (RUN_[MODE_NAME]()) and maps the output from the underlying structure prediction to emitted channels ingested by the reporting modules.modules/local/run_[MODE_NAME]: this is where the bulk of the compute work is done. Each underlying structure prediction module is bundled with its own Dockerfile to setup the software in a container, and a/modules/local/run_[MODE_NAME]/main.nfto execute the container from nextflow.- input:
metacontains the metadata info of this sub-job, including theidcolumn from thesamplesheet.csvaccessed by{meta.id}.path(fasta)(or more flexible yaml or json) locates the biomolecular input sequence file, wherefasta.baseNamegives the underlying input file name (not theidlabel).path(features)is used to pass through multiple sequence alignment (MSA) data, in line with AlphaFold2’s features.pkl file.- Other
path()s largely locate the core AlphaFold sequence databases (or module specific variants thereof).
- output:
- Outputs are structured as a bundled
tupleof two objects, the first is alwaysmetacontaining the metadata labels, and thenpath()to various output data files useful to the end-user. The prediction module is called in a way that return files to the process’s current directory (.).
- Outputs are structured as a bundled
"""script block""":program: the script block calls the program from the Nextflow shell with the programs typical--flags, in whatever form (binaryorscript.py) the program is distributed from its codebase repository.extract_metrics.py: accesses the canonical data output formats from the structure prediction program and returns a core set of plain text.tsvmetric files.
- input:
bin/extract_metrics.py: a globally accessible program to go from serialised data into.tsvplaintext. It currently applies format specific extraction logic for.pkl,.jsonand.npzfiles. However, as the community adds more--modes to the pipeline, different programs could use the same compressed output format. In which caseextract_metrics.pyshould be refactored to match based on the passing the--modetoextract_metrics.py.subworkflows/local/post_processing.nf: thePOST_PROCESSING{}process sits after all possible[MODE_NAME]()workflows in themain.nf. It passes along visualisation options, metrics data files, and report templates (singleorcomparison). Those reports are created with theGENERATE_REPORT()orCOMPARE_STRUCTURES()/module/local/modules, respectively.bin/generate_[comparison]_report.pytakes the HTML templates atassets/[report|comparison]_template.htmland populates them with plots created inside these python scripts.
Process labelling
At the top of a module’s RUN_[MODE_NAME]{} process, there are a series of labels that allow the nextflow.config to pass the job to the appropriate resources on the compute cluster. label 'process_gpu' is very useful to specify the AI inference stages requiring GPU-intensive computation. Other processes can use default labels that request CPU resources and, once finished, will naturally cascade onto GPU-enabled steps due to Nextflow’s dataflow paradigm.
Processable structure prediction metrics
Metrics from AlphaFold-inspired protein structure prediction programs are structured in two ways: tabular or as a matrix (PAE values)
When contributing a new mode to proteinfold, functionality should be added to extract_metrics.py to access the canonical ouput files of the new program, and extract data into compliant .tsv files that can be easily processed by downstream plotting and MultiQC functions.
Metrics files are 0 indexed.
pLDDT ({meta.id}_plddt.tsv)
Confidence values per residue, rounded to 2 decimal places. Each ranked result gets its own column (for all-atom modules, atomic token confidences are processed to a naive mean value across the residue).
Positions rank_0 rank_1 rank_2 rank_3 rank_4
0 83.58 85.27 88.41 86.22 84.91
1 97.99 97.81 97.39 97.49 97.32
2 98.22 98.42 98.16 97.88 97.81
3 98.06 98.15 97.94 97.56 97.4
4 98.67 98.56 98.3 98.38 98.29
5 98.81 98.77 98.62 98.61 98.54
6 98.79 98.74 98.57 98.59 98.52
...MSA ({meta.id}_{meta.mode}_msa.tsv)
The amino acid characters are converted to integers 0-19, unknown as 20, integer 21 represents the gap character.
19 5 5 4 10 16 15 3 8 15 13 16 12 9 17 16 9 4 8 11 0 7 7 8 11 0 19 8 8 5 3
19 5 5 4 10 16 15 3 8 15 13 16 12 9 17 16 9 4 8 11 0 7 7 8 11 0 6 8 8 5 13
19 5 5 4 10 5 15 13 14 0 14 16 12 9 17 16 9 4 14 11 0 7 5 8 15 4 5 8 3 5 21
19 5 5 4 10 16 15 3 8 15 13 16 12 9 17 16 9 4 8 11 0 7 7 8 11 0 19 8 8 5 21
19 5 5 4 10 16 15 3 8 15 13 16 12 9 17 16 9 4 8 11 0 7 7 8 11 0 19 8 8 5 13
19 5 5 4 10 16 15 3 8 15 13 16 12 9 7 16 9 4 8 11 0 7 7 8 11 0 6 8 8 5 13This allows easy sequence indentity calculation when processing as a numpy array.
(i)pTM ({meta.id}_[i]ptm.tsv)
(i)pTM scores, rounded to 3 decimal places, listed by the rank number (currently unsorted).
17 0.552
22 0.529
21 0.532
20 0.541
23 0.523
3 0.606
2 0.610
4 0.606
1 0.616
0 0.617
12 0.580
9 0.588
13 0.580
11 0.583
14 0.570
15 0.565
24 0.517
16 0.560
18 0.550
19 0.550
10 0.588
5 0.600
6 0.597
7 0.596
8 0.595chain-wise (i)pTM ({meta.id}_chainwise_[i]ptm.tsv)
(Symmetrical) ipTM scores, rounded to 4 decimal places, with chain pair lettering as the row (X:Y), and the rank number as the column. A pTM value is a chain’s own predicted Template Modelling score so lettering will be X:X.
0 1 2
A:B 0.2880 0.2750 0.2900
B:A 0.2904 0.2801 0.2915PAE ({meta.id}_{rank_number}_pae.tsv)
Predicted alignment error from residue i aligned to residue j, rounded to 4 decimal places.
The row number gives you the index of residue i and the column value within the row gives the index of residue j for the 2D PAE matrix.
Each model prediction generates a separate file containing the rank number. The _0_pae.tsv file corresponds to the top ranked model, other ranked results are stored within the ./pae folder.
0.2500 1.5710 3.9037 6.2177 8.4471 11.4583 12.9679 15.1237 18.0263 18.3868 18.9381 20.5747 19.3314 20.1825 21.6145 23.2190
2.2177 0.2500 1.5559 4.0327 6.3151 7.6372 10.1969 11.3626 14.9366 16.1303 17.9119 19.1877 21.2715 20.9531 20.1760 19.4087
3.4270 1.5284 0.2500 2.1333 3.5351 5.1049 6.6521 8.2317 12.1379 13.7185 14.9523 16.6154 19.6988 21.7614 18.6592 17.9619
6.1051 5.4206 2.5987 0.2500 2.0724 5.1454 6.7492 9.5538 9.6285 12.3868 13.8527 16.3586 17.2605 20.6381 19.9987 19.3295
7.3512 6.4947 5.5435 2.6740 0.2500 1.7561 4.9041 6.3923 8.9735 8.9272 12.3419 14.6005 15.9820 17.6358 20.5190 19.1028
7.4734 7.0899 5.8128 5.7512 2.0439 0.2500 1.8352 5.1064 6.4225 9.2098 10.5136 12.9404 14.3152 16.8122 18.6336 17.7382