TinyBERT for URL Phishing Detection

This model is fine-tuned from huawei-noah/TinyBERT_General_4L_312D to detect phishing URLs.

Model description

The model is a fine-tuned version of TinyBERT, specifically trained to classify URLs as either legitimate or phishing.

Intended uses & limitations

This model is intended to be used for detecting phishing URLs. It takes a URL as input and outputs a prediction of whether the URL is legitimate or phishing.

Training data

The model was trained on a combination of:

Legitimate URLs from the Majestic Million dataset
Phishing URLs from phishing-links-ACTIVE.txt and phishing-links-INACTIVE.txt

Training procedure

The model was fine-tuned using the Hugging Face Transformers library with the following parameters:

Learning rate: 5e-5
Batch size: 16
Number of epochs: 3
Weight decay: 0.01

Evaluation results

The model was evaluated on a test set consisting of both legitimate and phishing URLs.

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("songhieng/TinyBERT-URL-Detection-1.0")
model = AutoModelForSequenceClassification.from_pretrained("songhieng/TinyBERT-URL-Detection-1.0")

# Prepare URL for classification
url = "https://example.com"
inputs = tokenizer(url, return_tensors="pt", truncation=True, padding=True, max_length=128)

# Make prediction
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.softmax(outputs.logits, dim=1)
    label = torch.argmax(predictions, dim=1).item()

# Output result
result = "phishing" if label == 1 else "legitimate"
confidence = predictions[0][label].item()
print(f"URL: {url}")
print(f"Prediction: {result}")
print(f"Confidence: {confidence:.4f}")