Skip to content

Add run_dbcan screening for the CAZyme (carbohydrate active enzyme) and CGC (CAZyme Gene Cluster) annotation#483

Open
HaidYi wants to merge 69 commits intonf-core:devfrom
HaidYi:rundbcan
Open

Add run_dbcan screening for the CAZyme (carbohydrate active enzyme) and CGC (CAZyme Gene Cluster) annotation#483
HaidYi wants to merge 69 commits intonf-core:devfrom
HaidYi:rundbcan

Conversation

@HaidYi
Copy link

@HaidYi HaidYi commented Jul 2, 2025

PR checklist

Close #481.

The main changes include:

  • Like other screening tools, added a dedicated subworkflow (subworkflows/dbcan.nf) for the support of run_dbcan screening.
  • Added the annotation step for generating the .gff files and added the alias of the current modules (e.g., PYRODIGAL_GFF). So, the input gbk column may also use gff file as input. Feel free to change this part as it may need some tweaks considering the both the pipeline and the document.
  • Other utilities:
    • ci/cd, testing profiles for dbcan, module.config, etc.
    • documents: readme and output

Things that are needed the changes from the maintainer:

  • Add the changelog for this change in the next release version.
  • Add the dbcan screening step in the schematic workflow.

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/funcscan branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@HaidYi HaidYi self-assigned this Jul 2, 2025
@HaidYi HaidYi added the enhancement Improvement for existing functionality label Jul 2, 2025
@nf-core-bot
Copy link
Member

nf-core-bot commented Jul 2, 2025

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.5.1.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

Copy link
Collaborator

@jasmezz jasmezz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a great addition! @HaidYi I really appreciate your effort, your PR is really clear and on point. Thank you very much for this contribution. During review I directly pushed some minor changes to your fork.

Some other comments we could consider:

  • Thinking about renaming the new dbcan subworkflow to cazyme. This would be more in line with previous naming, i.e. subworkflow names tell the purpose, not the tool.
    • This would include changing the output dir in modules.config to ${params.outdir}/cazyme/cazyme_annotation, ${params.outdir}/cazyme/cgc, ${params.outdir}/cazyme/substrate
    • file tree in output docs
    • test names
    • nextflow_schema.json ...
  • The database download takes very long because of low download rate (>2 GB at at rate of ~ 1 MB/s). That is too long for the test profiles; we need to create a smaller database somehow...
  • Adding manual dbCAN database download (via bioconda) to the respective section in usage docs.

Comment on lines 35 to 36
dbcan_skip_cgc = true // skip cgc as .gbk is used
dbcan_skip_substrate = true // skip substrate as .gbk is used
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to be able to run the complete CAZyme subworkflow with pre-annotated .gff files while also providing pre-annotated .gbk files for other subworkflows, we need an additional (optional) column in the samplesheet.

docs/output.md Outdated
- `*_dbCAN_hmm_results.tsv`: TSV file containing the detailed dbCAN HMM results for CAZyme annotation.
- `*_dbCANsub_hmm_results.tsv`: TSV file containing the detailed dbCAN subfamily results for CAZyme annotation.
- `*_diamond.out`: TSV file containing the detailed dbCAN diamond results for CAZyme annotation.
- `cgc`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of the files of the cgc and substrate section seem duplicated. Maybe we don't need to store those which are created in the cazyme step already? Can control this in modules.config (e.g. see RGI_MAIN entry).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasmezz Thank you for reviewing the codes. I will revise it based on your comments.

Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good first PR @HaidYi ! Clean and pretty much all of my comments are sort of minor/just polishing

Some additional things to my direct comments:

run_bgc_screening = false
run_cazyme_screening = true

dbcan_skip_cgc = true // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add gff files!

You can generate them from a normal funcscan fun, and make a PR against teh funscan branch of nf-core/testdatasets, which has the files and an updated samplesheet for the next funcscan version

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently the cazyme screening can only use the .gff files in the pipeline. To use the pre-annotated one, I generated the .gff files from pyrodigal. The PR can be found at nf-core/test-datasets#1683.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be updated now you have the file?

docs/output.md Outdated
| ├── deepbgc/
| ├── gecco/
| └── hmmsearch/
├── dbcan/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The top level should be the molecule/gene type (i.e., cazyme), then a subdirectory with each tool (in this case dbcan), and within that each of the different output directories

docs/output.md Outdated

- `dbcan/`
- `cazyme`
- `*_overview.tsv`: TSV file containing the results of dbCAN CAZyme annotation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're missing the <sample.id> sample subdirectory underneath the tool name (accoeding to your modules.confg)

.join(ch_gffs_for_rundbcan)
.multiMap { meta, faa, gff ->
faa: [meta, faa]
gff: [meta, gff, 'prodigal']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the gff always from prodigal? Or is this a dummy value?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to the module description: https://nf-co.re/modules/rundbcan_easycgc/. If it's the generated in the pipeline, it is always the prodigal. But if it's provided using the pre-annotated one, then it could be either NCBI_prok, JGI, NCBI_euk or prodigal. This makes things complicated. An easier way is to define a parameter in the cli for this option but it's kind of hard to deal with the mixed case in a batch without doing the modifications in the samplesheet.

HaidYi and others added 4 commits July 16, 2025 19:24
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
@HaidYi
Copy link
Author

HaidYi commented Jul 17, 2025

@jfy133 Thank you for the comments and suggestions. I will fix all the problems one-by-one. As I don't want this PR corrupt other screening steps, I will do a more comprehensive testing, which may take more time. I will let you know when I fix all the issues.

@HaidYi
Copy link
Author

HaidYi commented Oct 16, 2025

I see you've not got to this yet @HaidYi , will check again next week :) (no rush though!)

@jfy133 Sure, I will get back to working on it this weekend. A few travels since last weekend, and getting sick this week.

@jfy133
Copy link
Member

jfy133 commented Oct 16, 2025

I hope you feel better soon @HaidYi !

jfy133 and others added 8 commits November 5, 2025 12:14
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
@HaidYi
Copy link
Author

HaidYi commented Nov 24, 2025

@jfy133 Hi James, I am back for this PR. Following your comments, I fixed the problems left if I understand your comments correctly. I will ensure it can pass all the ci/cd test cases. Then, you can give a review on it.

@jfy133
Copy link
Member

jfy133 commented Dec 8, 2025

Hi @HaidYi - sorry for the delay, lots of end of year deadlines.

Sounds good!

So the error is a time out during the downloading of the dbCan dtabase:

ERROR ~ Error executing process > 'NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_DATABASE'
    > 
    > Caused by:
    >   Process exceeded running time limit (1h)
    > 
    > 
    > Command executed:
    > 
    >   run_dbcan database \
    >       --db_dir dbcan_db
    >   
    >   cat <<-END_VERSIONS > versions.yml
    >   "NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_DATABASE":
    >       dbcan: $(echo $(run_dbcan version) | cut -f2 -d':' | cut -f2 -d' ')
    >   END_VERSIONS
    > 
    > Command exit status:
    >   -
    > 
    > Command output:
    >   (empty)
    > 
    > Command error:
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:30<06:37, 1.19MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:30<06:34, 1.20MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:30<06:40, 1.18MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:31<06:22, 1.24MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:31<06:24, 1.23MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:31<06:25, 1.23MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:31<06:32, 1.20MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:31<06:31, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:31<06:30, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:31<06:28, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:31<06:28, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:31<06:35, 1.19MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:31<06:32, 1.20MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:32<06:31, 1.20MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:32<06:29, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:32<06:28, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:32<06:29, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:32<06:34, 1.19MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:32<06:32, 1.20MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:32<06:30, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:32<06:28, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:32<06:35, 1.19MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:33<06:25, 1.22MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:33<06:25, 1.22MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:33<06:32, 1.20MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:33<06:30, 1.20MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 1.99G/2.46G [27:33<06:36, 1.18MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:33<06:25, 1.22MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:33<06:25, 1.22MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:33<06:32, 1.19MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:33<06:24, 1.22MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:34<06:24, 1.22MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:34<06:32, 1.19MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:34<06:29, 1.20MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:34<06:25, 1.22MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:34<06:27, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:34<06:26, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:34<06:33, 1.19MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:34<06:23, 1.22MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:34<06:30, 1.20MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:35<06:28, 1.20MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:35<06:26, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:35<06:26, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:35<06:24, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:35<06:25, 1.21MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:35<06:31, 1.19MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:35<06:22, 1.22MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:35<06:28, 1.20MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:35<06:21, 1.22MiB/s]
    >   Downloading dbCAN-sub.hmm:  81%|████████  | 2.00G/2.46G [27:35<06:28, 1.20MiB/s]
    > 
    > Work dir:
    >   /home/runner/_work/funcscan/funcscan/~/tests/d7fb331664515ac8c0840002642ad00c/work/83/1bba6f4132d4df56d5262258836271
    > 
    > Container:
    >   quay.io/biocontainers/dbcan:5.1.2--pyhdfd78af_0
    > 
    > Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
    > 
    >  -- Check '/home/runner/_work/funcscan/funcscan/~/tests/d7fb331664515ac8c0840002642ad00c/meta/nextflow.log' file for details
    > Execution cancelled -- Finishing pending tasks before exit
    > ERROR ~ Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting
    > 
    >  -- Check '/home/runner/_work/funcscan/funcscan/~/tests/d7fb331664515ac8c0840002642ad00c/meta/nextflow.log' file for details
    > -[nf-core/funcscan] Pipeline completed with errors-
    ```
    
    Normally our preferred workaround is to make a very small version of the database that we host with our test data (e.g. just a few genes/taxa etc), do you think that owuld be possible to make?

@HaidYi
Copy link
Author

HaidYi commented Dec 12, 2025

@jfy133 Thank you for pointing this. For this issue, I contacted the tool author. The problem is because the server has a low uploading bandwidth at UNL. So, the authors just got approved for hosting the data on AWS S3. So, they will update the nf-core/rundbcan_database module when they finish the transition of the database to S3.

Then, I think the time-out problem in the testing will be resolved automatically when I pull the newest module. I will keep you in the loop.

@jfy133
Copy link
Member

jfy133 commented Dec 12, 2025

Ok! Let's see if it helps 👍👍

@jfy133
Copy link
Member

jfy133 commented Dec 17, 2025

@HaidYi I will check again after the holidays, but I just had a though it may also be an idea to ask the developer to make a mini database anyway . It may be useful for other cases too, of just needs to include a couple of gene sequences so there is something that is compatible with running db_can (even if output is nonsense).

@HaidYi
Copy link
Author

HaidYi commented Jan 7, 2026

@jfy133 Happy new year! I hope you had a great holiday. Thanks to @Xinpeng021001 's work, the db_can tool has updated the database hosted from local server in the university to amazon s3 supported by AWS Open Data Sponsorship Program. And the tool has released the new version (v5.2.2) to reflect this change.

So, next step we will update dbcan nf-core module and solve this slow database downloading problem in this PR as well. Will keep you posted for the progress. Thanks.

@jfy133
Copy link
Member

jfy133 commented Jan 7, 2026

Wonderful and than kyou @Xinpeng021001 ! Much appreciated!

I'll keep an eye on this PR (just resovled a docs conflict just now) for updates :)

@HaidYi
Copy link
Author

HaidYi commented Feb 4, 2026

@jfy133 I updated the rundbcan module to aws for database downloading(nf-core/modules#9768). And this new PR now has no problems for the longtime db downloading problems. Please review again.

@HaidYi HaidYi requested a review from jfy133 February 4, 2026 16:04
Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK we are ALMOST DONE @HaidYi 🎉! Thank you for your patience!

Here are the last points/questions (to summarise some of the specific comments too), but otherwise code looks great, I've checked against our pipeline conventions (now on dev here and you're already following them already 💪:

Conceptual

  1. Can you confirm there are no db_can <subcmd> options/arguments that we should expose to the user via a pipeline parameter? E.g. for run_dbcan the --mode or --methods parameters? Or for the cgc_finder the parameter --use_distance ?

Code

  1. test_preannotated_cazyme.conf: You are missing a tests nf-test test file and it's snapshot for the new test config

Documentation

  1. usage.md: missing documentation in the sameplsheet section about the new gff column
  2. nextflow_schema.json: missing the long-form helptext(s) describing when you would want to maybe skip the cgc and substrate detection
  3. CHANGELOG.md: missing a change log entry of the PR, but also please make sure to add the version of db_can as a new dependency (i.e., the previous version column can be empty)
  4. README.md: don't forget to add yourself to the 'credits` list!
  5. nextflow.config: don't forget to add yourself to the manifest section as a contributor!

Comment on lines +35 to +36
dbcan_skip_cgc = false // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet
dbcan_skip_substrate = false // Skip substrate annotation as .gbk (not .gff) is provided in samplesheet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless the GBK/GFF files are mutually exclusive as input to funcscan, I would argue maybe it would make sense to include the GFF file in the samplesheet_preannotated.csv samplesheet

But it would be nice in another test profile (maybe test_cazyme_prokka) you still also test skipping the dbcan_skip_cgc and dbcan_skip_substrate functionality?

},
"dbcan_skip_cgc": {
"type": "boolean",
"description": "Skip CGC during the dbCAN screening.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still missing

@@ -0,0 +1,37 @@
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Nextflow config file for running minimal tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is still missing a tests/test.nf.test file and associated snapshot

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you've added a new optional column to the samplesheet, you need to add a description on this near the top of this page in the relevant section)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants