dada2-rs

An experimental implementation of DADA2 in Rust, using Claude Code (specically Sonnet 4.6 and Opus 4.6/4.7) for the bulk of the work.

Implementations

Rust ports of:
- Filter/trimming FASTQ (filterAndTrim)
- Dereplication (derepFastq)
- Error model (learnErrors)
- Denoising (dada)
- Merging (mergePairs)
- Current error models (loessErrfun, PacBioErrfun, noqualErrfun, makeBinnedErrfun) - only loessErrfun and PacBioErrfun have been tested
- Chimera removal (removeBimeraDenovo) - needs testing and optimization
- RDP taxonomic classifier (assignTaxonomy) + (assignSpecies) - needs testing and optimization
- Other intermediate functions (mergeTables)
Other functionality:
- Intermediate outputs (in JSON) - can be evaluated for debugging purposes or for plotting in R, Python, etc.
- Add basic regression tests for each stage that follow those within the original DADA2 repository
- Custom error models in R and/or Python - experimental, basic tests in place
- Docker builds available

Latest Benchmarks

The implementation is roughly 4.5x faster than the R DADA2 implementation (not including R loading overhead), and uses about 1/8 the memory. See also the Project Background for how this started and how we intend to move forward on this implementation.

Illumina MiSeq (MiSeq SOP data)

Per Issue #4. Tested on MacBook Pro (M1)

Threads	Rust real	Rust user	Rust RSS	R real	R user	R RSS	Speedup (real)	RSS ratio
1	173.71s	168.69s	262 MB	785.02s	765.68s	1222 MB	4.5×	4.7× less
4	47.66s	183.10s	283 MB	220.07s	783.07s	1413 MB	4.6×	5.0× less
8	33.52s	208.45s	276 MB	153.43s	922.67s	1404 MB	4.6×	5.1× less

Illumina MiSeq (i100 binned quality scores)

TODO, need example set

PacBio HiFi (standard, full quality scores)

TODO, need example set

PacBio HiFi (Kinnex, binned quality scores)

TODO, need example set

Building

Requires a recent stable Rust toolchain (rustup.rs).

cargo build --release
# binary at target/release/dada2-rs

Subcommands

Subcommand	Description
`filter-and-trim`	Filter and trim FASTQ reads (mirrors `filterAndTrim`)
`derep`	Dereplicate a FASTQ file
`sample`	Dereplicate and subsample FASTQ files, one JSON per sample
`errors-from-sample`	Learn error model from derep JSON files
`learn-errors`	Learn error model directly from FASTQ files
`dada`	Denoise a sample using a learned error model
`merge-pairs`	Merge denoised forward and reverse reads
`make-sequence-table`	Build a sample × sequence count table
`remove-bimera-denovo`	Remove chimeric sequences
`summary`	Per-position quality metrics from a FASTQ

Run dada2-rs <subcommand> --help for full parameter documentation.

Typical MiSeq workflow

The steps below follow the DADA2 MiSeq SOP. All intermediate outputs are JSON and can be inspected or plotted independently.

1. Filter and trim

dada2-rs filter-and-trim \
  --fwd  raw/sample_R1.fastq.gz --filt filtered/sample_R1.fastq.gz \
  --rev  raw/sample_R2.fastq.gz --filt-rev filtered/sample_R2.fastq.gz \
  --trunc-len 240 160 \
  --max-n 0 --max-ee 2 2 --trunc-q 2 \
  --compress --verbose

2. Learn the error model

Option A — from pre-computed derep JSON files (via sample subcommand):

# Dereplicate and subsample forward reads across all samples
dada2-rs sample filtered/*_R1.fastq.gz \
  --output-dir sample_json/ --nbases 100000000 --verbose

# Learn error model
dada2-rs errors-from-sample sample_json/*.json \
  --errfun loess -o errors_fwd.json --verbose

Option B — directly from FASTQ (single command):

dada2-rs learn-errors filtered/*_R1.fastq.gz \
  --nbases 100000000 --errfun loess \
  -o errors_fwd.json --verbose

Visualise cluster diagnostics during error learning

Pass --diag-dir to emit a iter_NNN.json file for each self-consistency iteration, then plot with the bundled R script:

dada2-rs errors-from-sample sample_json/*.json \
  --errfun loess --diag-dir diag_fwd/ -o errors_fwd.json --verbose

Rscript comparison/plot_cluster_diag.R diag_fwd/ diag_fwd/cluster_diag.pdf

The plot shows cluster counts, birth-type breakdown, convergence trace, and alignment work across iterations.

3. Denoise each sample

dada2-rs dada filtered/sample_R1.fastq.gz \
  --error-model errors_fwd.json --show-map \
  -o dada/sample_R1.json --verbose

Repeat for reverse reads using the reverse error model.

4. Merge paired reads

dada2-rs merge-pairs \
  --fwd-dada dada/fwd/*.json \
  --rev-dada dada/rev/*.json \
  --fwd-fastq filtered/fwd/*.fastq.gz \
  --rev-fastq filtered/rev/*.fastq.gz \
  -o merged.json --verbose

5. Build sequence table and remove chimeras

dada2-rs make-sequence-table merged.json -o seqtab.json
dada2-rs remove-bimera-denovo seqtab.json --method consensus -o seqtab_nochim.json

Visualise the error model

Rscript plot_errors.R errors_fwd.json errors_fwd.pdf

Comparing with R DADA2

The comparison/ directory contains scripts for validating output against the reference R implementation on the MiSeq SOP dataset:

# Run Rust pipeline (filter → derep → error model)
bash comparison/run_rust_errors.sh

# Compare error matrices against R's learnErrors()
Rscript comparison/compare_errors.R \
  comparison/comparison_out/F3D0_S188_L001_R1_001_filtered.fastq.gz \
  comparison/comparison_out/F3D0_S188_L001_R1_001_errors_rust.json

Requires R packages: dada2, jsonlite, ggplot2, gridExtra.

AI Assistance Disclosure

This tool was written with the assistance of AI coding agents, specificall Claude Code, using Sonnet 4.6. All commits using AI are noted.

Correctness is validated by comparing output against DADA2 v1.36 on a suite of real sequencing datasets - not by manual code review alone. AI generated the implementation; humans defined the validation criteria, made some key coding updates, and verified results.

Project background

The original purpose of the project was exploratory: walk through the steps in the DADA2 workflow to understand the underlying implementation for each key command, initially to replicate results from the R DADA2 workflow but to also explore potential paths for improving the implementation, as we use this quite extensively in our own research.

This is now at an interesting stage, as the current implementation up to the 'learn errors' error model step, closely matches results from DADA2 but runs about 5x faster with 8x less memory.

Learning

This was also meant to be a learning opportunity on several fronts. I have been learning Rust over the last few years in my (vanishingly small) spare time, and actually planned a general port of DADA2 a few years back. I had programmed various languages over the years (Perl, Python, R, C++, and a little Java), but apart from R and Python I'm pretty, um, rusty,

AI

I have noticed a dramatic improvement in coding-based AI tools and agents, also noted by many others in the field. However there has been a lot of discussion on the best approaches to reimplementation strategies involving these approaches, and many controversies along the way.

I've long been involved in open-source development and open-science projects, and I'm also the leader of a bioinformatics core group. I understand the community responses to this as well as the potential benefits to this community, and I believe community standards are needed that ensure we follow some general guidelines that ensure some degree of consistency and support, provenance that ensure the original implementors retain clear credit for their work, and where community members can play an active role.

Guidelines

AI is having a clear, fundamental, and disruptive impact, and simply ignoring this is to one's detriment. Some of us are also mentors, and as such we have an obligation to understand this wildly changing landscape and help prepare students, postdocs, and scientist on how to best use AI for their future career path. However many controveries exist over the use of these tools, in some cases arguably crossing moral and ethical boundaries. Standards are sorely needed.

Thankfully, within the bioinformatics community these are starting to coalesce, for example rewrites.bio. Therefore, this project will follow rewrites.bio guidelines as closely as possible to:

Ensure the original DADA2 developers and contributors are acknowledged,
Follows the original implementation's details,
Include tests and benchmarks for the work,
Utilize consistent libraries where possible,
Release as open source and follow the original licensing
Should interest arise: develop a community that can contribute.

Caveat

One point where this implementation will vary from the rewrites.bio standards: due to some key implementation details (conversion of R/C++ to Rust including error models), results will vary slightly. However we will strive to reproduce results as closely as possible. We have added the ability to use R and Python for custom error model analysis to more closely emulate what we see from the original implementation.

We also do not want to prevent additional outcomes or functionality that may come from the work in this project by being constrained to emulating the original code. For example, one interesting side benefit for PacBio HiFi reads has come from exposing the k-mer size as an option: alternative k-mer lengths appear to improve performance for PacBio denoising; this is something that needs to be explored more but could result in a substantial improvement in processing.

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
.githooks		.githooks
.github/workflows		.github/workflows
comparison		comparison
examples		examples
scripts		scripts
src		src
testdata		testdata
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs
plot_errors.R		plot_errors.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dada2-rs

Implementations

Latest Benchmarks

Illumina MiSeq (MiSeq SOP data)

Illumina MiSeq (i100 binned quality scores)

PacBio HiFi (standard, full quality scores)

PacBio HiFi (Kinnex, binned quality scores)

Building

Subcommands

Typical MiSeq workflow

1. Filter and trim

2. Learn the error model

Visualise cluster diagnostics during error learning

3. Denoise each sample

4. Merge paired reads

5. Build sequence table and remove chimeras

Visualise the error model

Comparing with R DADA2

AI Assistance Disclosure

Project background

Learning

AI

Guidelines

Caveat

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

dada2-rs

Implementations

Latest Benchmarks

Illumina MiSeq (MiSeq SOP data)

Illumina MiSeq (i100 binned quality scores)

PacBio HiFi (standard, full quality scores)

PacBio HiFi (Kinnex, binned quality scores)

Building

Subcommands

Typical MiSeq workflow

1. Filter and trim

2. Learn the error model

Visualise cluster diagnostics during error learning

3. Denoise each sample

4. Merge paired reads

5. Build sequence table and remove chimeras

Visualise the error model

Comparing with R DADA2

AI Assistance Disclosure

Project background

Learning

AI

Guidelines

Caveat

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages