burtenshaw's picture
burtenshaw HF Staff
boost gradio version to match requirements.txt
89dae52 verified

A newer version of the Gradio SDK is available: 5.33.0

Upgrade
metadata
title: Semantic Deduplication
emoji: 🧹
colorFrom: green
colorTo: green
sdk: gradio
sdk_version: 5.32.1
app_file: app.py
pinned: false
license: mit
short_description: Deduplicate HuggingFace datasets in seconds
hf_oauth: true
hf_oauth_scopes:
  - write-repos
  - manage-repos

Semantic Text Deduplication Using SemHash

This Gradio application performs semantic deduplication on HuggingFace datasets using SemHash with Model2Vec embeddings.

Features

  • Two deduplication modes:

    • Single dataset: Find and remove duplicates within one dataset
    • Cross-dataset: Remove entries from Dataset 2 that are similar to entries in Dataset 1
  • Customizable similarity threshold: Control how strict the deduplication should be (0.0 = very loose, 1.0 = exact matches only)

  • Detailed results: View statistics and examples of found duplicates with word-level differences highlighted

  • Hub Integration: 🆕 Push deduplicated datasets directly to the Hugging Face Hub after logging in

How to Use

1. Choose Deduplication Type

  • Cross-dataset: Useful for removing training data contamination from test sets
  • Single dataset: Clean up duplicate entries within a single dataset

2. Configure Datasets

  • Enter the HuggingFace dataset names (e.g., SetFit/amazon_massive_scenario_en-US)
  • Specify the dataset splits (e.g., train, test, validation)
  • Set the text column name (usually text, sentence, or content)

3. Set Similarity Threshold

  • 0.9 (default): Good balance between precision and recall
  • Higher values (0.95-0.99): More conservative, only removes very similar texts
  • Lower values (0.7-0.85): More aggressive, may remove semantically similar but different texts

4. Run Deduplication

Click "Deduplicate" to start the process. You'll see:

  • Loading progress for datasets
  • Deduplication progress
  • Results with statistics and example duplicates

5. Push to Hub (New!)

After deduplication completes:

  1. Log in with your Hugging Face account using the login button
  2. Enter a dataset name for your cleaned dataset
  3. Click "Push to Hub" to upload the deduplicated dataset

The dataset will be saved as your-username/dataset-name and be publicly available.

Notes

  • The app preserves all original columns from the datasets
  • Only the text similarity is used for deduplication decisions
  • Deduplicated datasets maintain the same structure as the original
  • OAuth login is required only for pushing to the Hub, not for deduplication