Hey everyone,
I’ve been working on a Rust CLI for dataset deduplication called fastdedup and wanted to share some benchmark results since dataset prep tooling comes up a lot here.
Ran both exact and fuzzy dedup against standard Python baselines on FineWeb sample-10BT (14.8M records, 29GB) on a single Hetzner CCX43 instance.
Exact dedup vs DuckDB + SHA-256
| fastdedup | DuckDB | |
|---|---|---|
| Wall clock | 2:55 | 7:55 |
| Peak RAM | 688 MB | 21.9 GB |
| CPU cores | 1 | 4+ |
| Records/sec | ~85,000 | ~31,000 |
| Duplicates removed | 51,392 | 51,392 |
2.7x faster, 32x less RAM, on a single core vs 4+. Duplicate counts match exactly.
Fuzzy dedup (MinHash + LSH) vs datatrove
| fastdedup | datatrove | |
|---|---|---|
| Wall clock | 36:44 | 3h50m+ (stage 1 only, terminated) |
| Peak RAM | 23 GB | 1.1 GB |
| Completed | Y | N |
| Duplicates removed | 105,044 (0.7%) | — |
datatrove’s stage 1 alone ran for 3h50m and we terminated it. The bottleneck turned out to be spaCy word tokenization on every document before shingling — fastdedup uses character n-grams directly which is significantly cheaper.
On the RAM trade-off: datatrove streams to disk keeping RAM low at the cost of heavy I/O between stages. fastdedup holds the LSH index in memory for speed. Different trade-offs — worth being transparent about.
Caveats
-
Fuzzy dedup requires ~23GB RAM at this scale — cloud instance workload, not a laptop workload
-
datatrove is designed for distributed execution and
tasks=1is not its optimal configuration — this benchmark represents how someone might run it locally -
Tiered storage to spill the LSH index to disk is on the roadmap
There’s a small Gradio demo on HF Spaces if you want to test on a small file: [Spaces Link]
Full benchmark methodology, scripts, and raw results are in the repo:[GitHub link]
Happy to answer questions about the implementation or methodology.