Fastdedup: Rust-based dataset deduplication — benchmarks on FineWeb sample-10BT

wapplewhite4 · February 20, 2026, 9:44pm

Hey everyone,

I’ve been working on a Rust CLI for dataset deduplication called fastdedup and wanted to share some benchmark results since dataset prep tooling comes up a lot here.

Ran both exact and fuzzy dedup against standard Python baselines on FineWeb sample-10BT (14.8M records, 29GB) on a single Hetzner CCX43 instance.

Exact dedup vs DuckDB + SHA-256

	fastdedup	DuckDB
Wall clock	2:55	7:55
Peak RAM	688 MB	21.9 GB
CPU cores	1	4+
Records/sec	~85,000	~31,000
Duplicates removed	51,392	51,392

2.7x faster, 32x less RAM, on a single core vs 4+. Duplicate counts match exactly.

Fuzzy dedup (MinHash + LSH) vs datatrove

	fastdedup	datatrove
Wall clock	36:44	3h50m+ (stage 1 only, terminated)
Peak RAM	23 GB	1.1 GB
Completed	Y	N
Duplicates removed	105,044 (0.7%)	—

datatrove’s stage 1 alone ran for 3h50m and we terminated it. The bottleneck turned out to be spaCy word tokenization on every document before shingling — fastdedup uses character n-grams directly which is significantly cheaper.

On the RAM trade-off: datatrove streams to disk keeping RAM low at the cost of heavy I/O between stages. fastdedup holds the LSH index in memory for speed. Different trade-offs — worth being transparent about.

Caveats

Fuzzy dedup requires ~23GB RAM at this scale — cloud instance workload, not a laptop workload
datatrove is designed for distributed execution and tasks=1 is not its optimal configuration — this benchmark represents how someone might run it locally
Tiered storage to spill the LSH index to disk is on the roadmap

There’s a small Gradio demo on HF Spaces if you want to test on a small file: [Spaces Link]

Full benchmark methodology, scripts, and raw results are in the repo:[GitHub link]

Happy to answer questions about the implementation or methodology.

Topic		Replies	Views
Minhash Deduplication 🤗Datasets	15	7744	August 6, 2022
How can I drop duplicates on datasets module? Beginners	3	8368	July 5, 2022
Fetching rows of a large Dataset by index 🤗Datasets	10	1706	March 15, 2021
Explain why datasets.map is faster compared to other similar libraries 🤗Datasets	4	930	September 6, 2022
Collapse duplicates in dataset and treat it as usual 🤗Datasets	5	292	July 5, 2024

Fastdedup: Rust-based dataset deduplication — benchmarks on FineWeb sample-10BT

Exact dedup vs DuckDB + SHA-256

Fuzzy dedup (MinHash + LSH) vs datatrove

Caveats

Related topics