File size: 2,637 Bytes
7125c18
 
9757fba
7125c18
9757fba
7125c18
89dae52
7125c18
 
 
 
b1ba346
 
 
 
7125c18
 
b1ba346
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
---
title: Semantic Deduplication
emoji: 🧹
colorFrom: green
colorTo: green
sdk: gradio
sdk_version: 5.32.1
app_file: app.py
pinned: false
license: mit
short_description: Deduplicate HuggingFace datasets in seconds
hf_oauth: true
hf_oauth_scopes:
  - write-repos
  - manage-repos
---

# Semantic Text Deduplication Using SemHash

This Gradio application performs **semantic deduplication** on HuggingFace datasets using [SemHash](https://github.com/MinishLab/semhash) with [Model2Vec](https://github.com/MinishLab/model2vec) embeddings.

## Features

- **Two deduplication modes**:
  - **Single dataset**: Find and remove duplicates within one dataset
  - **Cross-dataset**: Remove entries from Dataset 2 that are similar to entries in Dataset 1

- **Customizable similarity threshold**: Control how strict the deduplication should be (0.0 = very loose, 1.0 = exact matches only)

- **Detailed results**: View statistics and examples of found duplicates with word-level differences highlighted

- **Hub Integration**: 🆕 **Push deduplicated datasets directly to the Hugging Face Hub** after logging in

## How to Use

### 1. Choose Deduplication Type
- **Cross-dataset**: Useful for removing training data contamination from test sets
- **Single dataset**: Clean up duplicate entries within a single dataset

### 2. Configure Datasets
- Enter the HuggingFace dataset names (e.g., `SetFit/amazon_massive_scenario_en-US`)
- Specify the dataset splits (e.g., `train`, `test`, `validation`)
- Set the text column name (usually `text`, `sentence`, or `content`)

### 3. Set Similarity Threshold
- **0.9** (default): Good balance between precision and recall
- **Higher values** (0.95-0.99): More conservative, only removes very similar texts
- **Lower values** (0.7-0.85): More aggressive, may remove semantically similar but different texts

### 4. Run Deduplication
Click **"Deduplicate"** to start the process. You'll see:
- Loading progress for datasets
- Deduplication progress
- Results with statistics and example duplicates

### 5. Push to Hub (New!)
After deduplication completes:
1. **Log in** with your Hugging Face account using the login button
2. Enter a **dataset name** for your cleaned dataset
3. Click **"Push to Hub"** to upload the deduplicated dataset

The dataset will be saved as `your-username/dataset-name` and be publicly available.


## Notes

- The app preserves all original columns from the datasets
- Only the text similarity is used for deduplication decisions
- Deduplicated datasets maintain the same structure as the original
- OAuth login is required only for pushing to the Hub, not for deduplication