Fastdedup: Rust-based dataset deduplication — benchmarks on FineWeb sample-10BT

Hey everyone,

I’ve been working on a Rust CLI for dataset deduplication called fastdedup and wanted to share some benchmark results since dataset prep tooling comes up a lot here.

Ran both exact and fuzzy dedup against standard Python baselines on FineWeb sample-10BT (14.8M records, 29GB) on a single Hetzner CCX43 instance.


Exact dedup vs DuckDB + SHA-256

fastdedup DuckDB
Wall clock 2:55 7:55
Peak RAM 688 MB 21.9 GB
CPU cores 1 4+
Records/sec ~85,000 ~31,000
Duplicates removed 51,392 51,392

2.7x faster, 32x less RAM, on a single core vs 4+. Duplicate counts match exactly.


Fuzzy dedup (MinHash + LSH) vs datatrove

fastdedup datatrove
Wall clock 36:44 3h50m+ (stage 1 only, terminated)
Peak RAM 23 GB 1.1 GB
Completed Y N
Duplicates removed 105,044 (0.7%)

datatrove’s stage 1 alone ran for 3h50m and we terminated it. The bottleneck turned out to be spaCy word tokenization on every document before shingling — fastdedup uses character n-grams directly which is significantly cheaper.

On the RAM trade-off: datatrove streams to disk keeping RAM low at the cost of heavy I/O between stages. fastdedup holds the LSH index in memory for speed. Different trade-offs — worth being transparent about.


Caveats

  • Fuzzy dedup requires ~23GB RAM at this scale — cloud instance workload, not a laptop workload

  • datatrove is designed for distributed execution and tasks=1 is not its optimal configuration — this benchmark represents how someone might run it locally

  • Tiered storage to spill the LSH index to disk is on the roadmap


There’s a small Gradio demo on HF Spaces if you want to test on a small file: [Spaces Link]

Full benchmark methodology, scripts, and raw results are in the repo:[GitHub link]

Happy to answer questions about the implementation or methodology.

1 Like