Add run_dbcan screening for the CAZyme (carbohydrate active enzyme) and CGC (CAZyme Gene Cluster) annotation#483
Add run_dbcan screening for the CAZyme (carbohydrate active enzyme) and CGC (CAZyme Gene Cluster) annotation#483HaidYi wants to merge 69 commits intonf-core:devfrom
Conversation
|
Warning Newer version of the nf-core template is available. Your pipeline is using an old version of the nf-core template: 3.5.1. For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation. |
jasmezz
left a comment
There was a problem hiding this comment.
What a great addition! @HaidYi I really appreciate your effort, your PR is really clear and on point. Thank you very much for this contribution. During review I directly pushed some minor changes to your fork.
Some other comments we could consider:
- Thinking about renaming the new
dbcansubworkflow tocazyme. This would be more in line with previous naming, i.e. subworkflow names tell the purpose, not the tool.- This would include changing the output dir in
modules.configto${params.outdir}/cazyme/cazyme_annotation,${params.outdir}/cazyme/cgc,${params.outdir}/cazyme/substrate - file tree in output docs
- test names
- nextflow_schema.json ...
- This would include changing the output dir in
- The database download takes very long because of low download rate (>2 GB at at rate of ~ 1 MB/s). That is too long for the test profiles; we need to create a smaller database somehow...
- Adding manual dbCAN database download (via bioconda) to the respective section in usage docs.
conf/test_preannotated_dbcan.config
Outdated
| dbcan_skip_cgc = true // skip cgc as .gbk is used | ||
| dbcan_skip_substrate = true // skip substrate as .gbk is used |
There was a problem hiding this comment.
If we want to be able to run the complete CAZyme subworkflow with pre-annotated .gff files while also providing pre-annotated .gbk files for other subworkflows, we need an additional (optional) column in the samplesheet.
docs/output.md
Outdated
| - `*_dbCAN_hmm_results.tsv`: TSV file containing the detailed dbCAN HMM results for CAZyme annotation. | ||
| - `*_dbCANsub_hmm_results.tsv`: TSV file containing the detailed dbCAN subfamily results for CAZyme annotation. | ||
| - `*_diamond.out`: TSV file containing the detailed dbCAN diamond results for CAZyme annotation. | ||
| - `cgc` |
There was a problem hiding this comment.
Many of the files of the cgc and substrate section seem duplicated. Maybe we don't need to store those which are created in the cazyme step already? Can control this in modules.config (e.g. see RGI_MAIN entry).
There was a problem hiding this comment.
@jasmezz Thank you for reviewing the codes. I will revise it based on your comments.
jfy133
left a comment
There was a problem hiding this comment.
Really good first PR @HaidYi ! Clean and pretty much all of my comments are sort of minor/just polishing
Some additional things to my direct comments:
- Missing
citations.mdupdate - Missing the how to cite/methods text in this file: https://github.com/HaidYi/funcscan/blob/0cad8f95c553b3cdd3a59c34a0db107bd6df14f4/subworkflows/local/utils_nfcore_funcscan_pipeline/main.nf#L174
- Missing metromap update (but we can probably do this before release)
- Missing nf-test test and snapshots for the new tests
conf/test_preannotated_dbcan.config
Outdated
| run_bgc_screening = false | ||
| run_cazyme_screening = true | ||
|
|
||
| dbcan_skip_cgc = true // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet |
There was a problem hiding this comment.
We should probably add gff files!
You can generate them from a normal funcscan fun, and make a PR against teh funscan branch of nf-core/testdatasets, which has the files and an updated samplesheet for the next funcscan version
There was a problem hiding this comment.
Yes, currently the cazyme screening can only use the .gff files in the pipeline. To use the pre-annotated one, I generated the .gff files from pyrodigal. The PR can be found at nf-core/test-datasets#1683.
There was a problem hiding this comment.
Can this be updated now you have the file?
docs/output.md
Outdated
| | ├── deepbgc/ | ||
| | ├── gecco/ | ||
| | └── hmmsearch/ | ||
| ├── dbcan/ |
There was a problem hiding this comment.
The top level should be the molecule/gene type (i.e., cazyme), then a subdirectory with each tool (in this case dbcan), and within that each of the different output directories
docs/output.md
Outdated
|
|
||
| - `dbcan/` | ||
| - `cazyme` | ||
| - `*_overview.tsv`: TSV file containing the results of dbCAN CAZyme annotation |
There was a problem hiding this comment.
You're missing the <sample.id> sample subdirectory underneath the tool name (accoeding to your modules.confg)
subworkflows/local/cazyme.nf
Outdated
| .join(ch_gffs_for_rundbcan) | ||
| .multiMap { meta, faa, gff -> | ||
| faa: [meta, faa] | ||
| gff: [meta, gff, 'prodigal'] |
There was a problem hiding this comment.
Is the gff always from prodigal? Or is this a dummy value?
There was a problem hiding this comment.
Refer to the module description: https://nf-co.re/modules/rundbcan_easycgc/. If it's the generated in the pipeline, it is always the prodigal. But if it's provided using the pre-annotated one, then it could be either NCBI_prok, JGI, NCBI_euk or prodigal. This makes things complicated. An easier way is to define a parameter in the cli for this option but it's kind of hard to deal with the mixed case in a batch without doing the modifications in the samplesheet.
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
|
@jfy133 Thank you for the comments and suggestions. I will fix all the problems one-by-one. As I don't want this PR corrupt other screening steps, I will do a more comprehensive testing, which may take more time. I will let you know when I fix all the issues. |
|
I hope you feel better soon @HaidYi ! |
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
|
@jfy133 Hi James, I am back for this PR. Following your comments, I fixed the problems left if I understand your comments correctly. I will ensure it can pass all the |
|
Hi @HaidYi - sorry for the delay, lots of end of year deadlines. Sounds good! So the error is a time out during the downloading of the dbCan dtabase: |
|
@jfy133 Thank you for pointing this. For this issue, I contacted the tool author. The problem is because the server has a low uploading bandwidth at UNL. So, the authors just got approved for hosting the data on AWS S3. So, they will update the Then, I think the time-out problem in the testing will be resolved automatically when I pull the newest module. I will keep you in the loop. |
|
Ok! Let's see if it helps 👍👍 |
|
@HaidYi I will check again after the holidays, but I just had a though it may also be an idea to ask the developer to make a mini database anyway . It may be useful for other cases too, of just needs to include a couple of gene sequences so there is something that is compatible with running db_can (even if output is nonsense). |
|
@jfy133 Happy new year! I hope you had a great holiday. Thanks to @Xinpeng021001 's work, the db_can tool has updated the database hosted from local server in the university to amazon s3 supported by AWS Open Data Sponsorship Program. And the tool has released the new version (v5.2.2) to reflect this change. So, next step we will update dbcan nf-core module and solve this slow database downloading problem in this PR as well. Will keep you posted for the progress. Thanks. |
|
Wonderful and than kyou @Xinpeng021001 ! Much appreciated! I'll keep an eye on this PR (just resovled a docs conflict just now) for updates :) |
|
@jfy133 I updated the rundbcan module to aws for database downloading(nf-core/modules#9768). And this new PR now has no problems for the longtime db downloading problems. Please review again. |
jfy133
left a comment
There was a problem hiding this comment.
OK we are ALMOST DONE @HaidYi 🎉! Thank you for your patience!
Here are the last points/questions (to summarise some of the specific comments too), but otherwise code looks great, I've checked against our pipeline conventions (now on dev here and you're already following them already 💪:
Conceptual
- Can you confirm there are no
db_can <subcmd>options/arguments that we should expose to the user via a pipeline parameter? E.g. forrun_dbcanthe--modeor--methodsparameters? Or for thecgc_finderthe parameter--use_distance?
Code
test_preannotated_cazyme.conf: You are missing atestsnf-test test file and it's snapshot for the new test config
Documentation
usage.md: missing documentation in the sameplsheet section about the newgffcolumnnextflow_schema.json: missing the long-form helptext(s) describing when you would want to maybe skip thecgcandsubstratedetectionCHANGELOG.md: missing a change log entry of the PR, but also please make sure to add the version ofdb_canas a new dependency (i.e., the previous version column can be empty)README.md: don't forget to add yourself to the 'credits` list!nextflow.config: don't forget to add yourself to the manifest section as acontributor!
| dbcan_skip_cgc = false // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet | ||
| dbcan_skip_substrate = false // Skip substrate annotation as .gbk (not .gff) is provided in samplesheet |
There was a problem hiding this comment.
Unless the GBK/GFF files are mutually exclusive as input to funcscan, I would argue maybe it would make sense to include the GFF file in the samplesheet_preannotated.csv samplesheet
But it would be nice in another test profile (maybe test_cazyme_prokka) you still also test skipping the dbcan_skip_cgc and dbcan_skip_substrate functionality?
| }, | ||
| "dbcan_skip_cgc": { | ||
| "type": "boolean", | ||
| "description": "Skip CGC during the dbCAN screening.", |
| @@ -0,0 +1,37 @@ | |||
| /* | |||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |||
| Nextflow config file for running minimal tests | |||
There was a problem hiding this comment.
This file is still missing a tests/test.nf.test file and associated snapshot
There was a problem hiding this comment.
As you've added a new optional column to the samplesheet, you need to add a description on this near the top of this page in the relevant section)
PR checklist
Close #481.
The main changes include:
subworkflows/dbcan.nf) for the support of run_dbcan screening..gfffiles and added the alias of the current modules (e.g.,PYRODIGAL_GFF). So, the inputgbkcolumn may also usegfffile as input. Feel free to change this part as it may need some tweaks considering the both the pipeline and the document.module.config, etc.readmeandoutputThings that are needed the changes from the maintainer:
nf-core pipelines lint).nextflow run . -profile test,docker --outdir <OUTDIR>).nextflow run . -profile debug,test,docker --outdir <OUTDIR>).docs/usage.mdis updated.docs/output.mdis updated.CHANGELOG.mdis updated.README.mdis updated (including new tool citations and authors/contributors).