metadata

title: Semantic Deduplication
emoji: 🧹
colorFrom: green
colorTo: green
sdk: gradio
sdk_version: 5.32.1
app_file: app.py
pinned: false
license: mit
short_description: Deduplicate HuggingFace datasets in seconds
hf_oauth: true
hf_oauth_scopes:
  - write-repos
  - manage-repos

Semantic Text Deduplication Using SemHash

This Gradio application performs semantic deduplication on HuggingFace datasets using SemHash with Model2Vec embeddings.

Features

Two deduplication modes:
- Single dataset: Find and remove duplicates within one dataset
- Cross-dataset: Remove entries from Dataset 2 that are similar to entries in Dataset 1
Customizable similarity threshold: Control how strict the deduplication should be (0.0 = very loose, 1.0 = exact matches only)
Detailed results: View statistics and examples of found duplicates with word-level differences highlighted
Hub Integration: 🆕 Push deduplicated datasets directly to the Hugging Face Hub after logging in

How to Use

1. Choose Deduplication Type

Cross-dataset: Useful for removing training data contamination from test sets
Single dataset: Clean up duplicate entries within a single dataset

2. Configure Datasets

Enter the HuggingFace dataset names (e.g., SetFit/amazon_massive_scenario_en-US)
Specify the dataset splits (e.g., train, test, validation)
Set the text column name (usually text, sentence, or content)

3. Set Similarity Threshold

0.9 (default): Good balance between precision and recall
Higher values (0.95-0.99): More conservative, only removes very similar texts
Lower values (0.7-0.85): More aggressive, may remove semantically similar but different texts

4. Run Deduplication

Click "Deduplicate" to start the process. You'll see:

Loading progress for datasets
Deduplication progress
Results with statistics and example duplicates

5. Push to Hub (New!)

After deduplication completes:

Log in with your Hugging Face account using the login button
Enter a dataset name for your cleaned dataset
Click "Push to Hub" to upload the deduplicated dataset

The dataset will be saved as your-username/dataset-name and be publicly available.

Notes

The app preserves all original columns from the datasets
Only the text similarity is used for deduplication decisions
Deduplicated datasets maintain the same structure as the original
OAuth login is required only for pushing to the Hub, not for deduplication