A newer version of the Gradio SDK is available:
5.33.0
metadata
title: Semantic Deduplication
emoji: 🧹
colorFrom: green
colorTo: green
sdk: gradio
sdk_version: 5.32.1
app_file: app.py
pinned: false
license: mit
short_description: Deduplicate HuggingFace datasets in seconds
hf_oauth: true
hf_oauth_scopes:
- write-repos
- manage-repos
Semantic Text Deduplication Using SemHash
This Gradio application performs semantic deduplication on HuggingFace datasets using SemHash with Model2Vec embeddings.
Features
Two deduplication modes:
- Single dataset: Find and remove duplicates within one dataset
- Cross-dataset: Remove entries from Dataset 2 that are similar to entries in Dataset 1
Customizable similarity threshold: Control how strict the deduplication should be (0.0 = very loose, 1.0 = exact matches only)
Detailed results: View statistics and examples of found duplicates with word-level differences highlighted
Hub Integration: 🆕 Push deduplicated datasets directly to the Hugging Face Hub after logging in
How to Use
1. Choose Deduplication Type
- Cross-dataset: Useful for removing training data contamination from test sets
- Single dataset: Clean up duplicate entries within a single dataset
2. Configure Datasets
- Enter the HuggingFace dataset names (e.g.,
SetFit/amazon_massive_scenario_en-US
) - Specify the dataset splits (e.g.,
train
,test
,validation
) - Set the text column name (usually
text
,sentence
, orcontent
)
3. Set Similarity Threshold
- 0.9 (default): Good balance between precision and recall
- Higher values (0.95-0.99): More conservative, only removes very similar texts
- Lower values (0.7-0.85): More aggressive, may remove semantically similar but different texts
4. Run Deduplication
Click "Deduplicate" to start the process. You'll see:
- Loading progress for datasets
- Deduplication progress
- Results with statistics and example duplicates
5. Push to Hub (New!)
After deduplication completes:
- Log in with your Hugging Face account using the login button
- Enter a dataset name for your cleaned dataset
- Click "Push to Hub" to upload the deduplicated dataset
The dataset will be saved as your-username/dataset-name
and be publicly available.
Notes
- The app preserves all original columns from the datasets
- Only the text similarity is used for deduplication decisions
- Deduplicated datasets maintain the same structure as the original
- OAuth login is required only for pushing to the Hub, not for deduplication