File size: 1,977 Bytes
974b711
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
---
language: en
license: mit
tags:
- url-phishing-detection
- tinybert
- sequence-classification
datasets:
- custom
metrics:
- accuracy
- f1
---

# TinyBERT for URL Phishing Detection

This model is fine-tuned from huawei-noah/TinyBERT_General_4L_312D to detect phishing URLs.

## Model description

The model is a fine-tuned version of TinyBERT, specifically trained to classify URLs as either legitimate or phishing.

## Intended uses & limitations

This model is intended to be used for detecting phishing URLs. It takes a URL as input and outputs a prediction of whether the URL is legitimate or phishing.

## Training data

The model was trained on a combination of:
- Legitimate URLs from the Majestic Million dataset
- Phishing URLs from phishing-links-ACTIVE.txt and phishing-links-INACTIVE.txt

## Training procedure

The model was fine-tuned using the Hugging Face Transformers library with the following parameters:
- Learning rate: 5e-5
- Batch size: 16
- Number of epochs: 3
- Weight decay: 0.01

## Evaluation results

The model was evaluated on a test set consisting of both legitimate and phishing URLs.

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("songhieng/TinyBERT-URL-Detection-1.0")
model = AutoModelForSequenceClassification.from_pretrained("songhieng/TinyBERT-URL-Detection-1.0")

# Prepare URL for classification
url = "https://example.com"
inputs = tokenizer(url, return_tensors="pt", truncation=True, padding=True, max_length=128)

# Make prediction
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.softmax(outputs.logits, dim=1)
    label = torch.argmax(predictions, dim=1).item()

# Output result
result = "phishing" if label == 1 else "legitimate"
confidence = predictions[0][label].item()
print(f"URL: {url}")
print(f"Prediction: {result}")
print(f"Confidence: {confidence:.4f}")
```