--- library_name: transformers tags: - cybersecurity - mpnet - classification - fine-tuned --- # AttackGroup-MPNET - Model Card for MPNet Cybersecurity Classifier This is a fine-tuned MPNet model specialized for classifying cybersecurity threat groups based on textual descriptions of their tactics and techniques. ## Model Details ### Model Description This model is a fine-tuned MPNet classifier specialized in categorizing cybersecurity threat groups based on textual descriptions of their tactics, techniques, and procedures (TTPs). - **Developed by:** Dženan Hamzić - **Model type:** Transformer-based classification model (MPNet) - **Language(s) (NLP):** English - **License:** Apache-2.0 - **Finetuned from model:** microsoft/mpnet-base (with intermediate MLM fine-tuning) ### Model Sources - **Base Model:** [microsoft/mpnet-base](https://huggingface.co/microsoft/mpnet-base) ## Uses ### Direct Use This model classifies textual cybersecurity descriptions into known cybersecurity threat groups. ### Downstream Use Integration into Cyber Threat Intelligence platforms, SOC incident analysis tools, and automated threat detection systems. ### Out-of-Scope Use - General language tasks unrelated to cybersecurity - Tasks outside the cybersecurity domain ## Bias, Risks, and Limitations This model specializes in cybersecurity contexts. Predictions for unrelated contexts may be inaccurate. ### Recommendations Always verify predictions with cybersecurity analysts before using in critical decision-making scenarios. ## How to Get Started with the Model (Classification) ```python import torch import torch.nn as nn from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch.optim as optim import numpy as np from huggingface_hub import hf_hub_download import json device = torch.device("cuda" if torch.cuda.is_available() else "cpu") label_to_groupid_file = hf_hub_download( repo_id="selfconstruct3d/AttackGroup-MPNET", filename="label_to_groupid.json" ) with open(label_to_groupid_file, "r") as f: label_to_groupid = json.load(f) # Load explicitly your fine-tuned MPNet model classifier_model = AutoModelForSequenceClassification.from_pretrained("selfconstruct3d/AttackGroup-MPNET", num_labels=len(label_to_groupid)).to(device) # Load explicitly your tokenizer tokenizer = AutoTokenizer.from_pretrained("selfconstruct3d/AttackGroup-MPNET") def predict_group(sentence): classifier_model.eval() encoding = tokenizer( sentence, truncation=True, padding="max_length", max_length=128, return_tensors="pt" ) input_ids = encoding["input_ids"].to(device) attention_mask = encoding["attention_mask"].to(device) with torch.no_grad(): outputs = classifier_model(input_ids=input_ids, attention_mask=attention_mask) logits = outputs.logits predicted_label = torch.argmax(logits, dim=1).cpu().item() predicted_groupid = label_to_groupid[str(predicted_label)] return predicted_groupid # Example usage explicitly: sentence = "APT38 has used phishing emails with malicious links to distribute malware." predicted_class = predict_group(sentence) print(f"Predicted GroupID: {predicted_class}") ``` Predicted GroupID: G0001 ## How to Get Started with the Model (Embeddings) ```python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification from huggingface_hub import hf_hub_download import json device = torch.device("cuda" if torch.cuda.is_available() else "cpu") label_to_groupid_file = hf_hub_download( repo_id="selfconstruct3d/AttackGroup-MPNET", filename="label_to_groupid.json" ) with open(label_to_groupid_file, "r") as f: label_to_groupid = json.load(f) # Load your fine-tuned classification model model_name = "selfconstruct3d/AttackGroup-MPNET" tokenizer = AutoTokenizer.from_pretrained(model_name) classifier_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(label_to_groupid)).to(device) def get_embedding(sentence): classifier_model.eval() encoding = tokenizer( sentence, truncation=True, padding="max_length", max_length=128, return_tensors="pt" ) input_ids = encoding["input_ids"].to(device) attention_mask = encoding["attention_mask"].to(device) with torch.no_grad(): outputs = classifier_model.mpnet(input_ids=input_ids, attention_mask=attention_mask) cls_embedding = outputs.last_hidden_state[:, 0, :].cpu().numpy().flatten() return cls_embedding # Example explicitly: sentence = "APT38 has used phishing emails with malicious links to distribute malware." embedding = get_embedding(sentence) print("Embedding shape:", embedding.shape) print("Embedding values:", embedding) ``` ## Training Details ### Training Data To be anounced... ### Training Procedure - Fine-tuned from: MLM fine-tuned MPNet ("mpnet_mlm_cyber_finetuned-v2") - Epochs: 32 - Learning rate: 5e-6 - Batch size: 16 ## Evaluation ### Testing Data, Factors & Metrics - **Testing Data:** Stratified sample from original dataset. - **Metrics:** Accuracy, Weighted F1 Score ### Results | Metric | Value | |------------------------|---------| | Cl. Accuracy (Test) | 0.9564 | | W. F1 Score (Test) | 0.9577 | ## Evaluation Results | Model | Embedding Variability | Accuracy | |-----------------------------------------------|-----------------------|----------| | Original MPNet | 0.085554 | 0.9964 | | MLM Fine-tuned MPNet | 0.034983 | 0.6536 | | ** AttackGroup-MPNET ** | 0.193065 | 0.9508 | | SecBERT | 0.591303 | 0.9886 | | ATTACK-BERT | 0.096108 | 0.9678 | | SecureBERT | 0.007100 | 0.4931 | ### Single Prediction Example ```python import torch import torch.nn as nn from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch.optim as optim import numpy as np from huggingface_hub import hf_hub_download import json device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Load explicitly your fine-tuned MPNet model classifier_model = AutoModelForSequenceClassification.from_pretrained("selfconstruct3d/AttackGroup-MPNET").to(device) # Load explicitly your tokenizer tokenizer = AutoTokenizer.from_pretrained("selfconstruct3d/AttackGroup-MPNET") label_to_groupid_file = hf_hub_download( repo_id="selfconstruct3d/AttackGroup-MPNET", filename="label_to_groupid.json" ) with open(label_to_groupid_file, "r") as f: label_to_groupid = json.load(f) def predict_group(sentence): classifier_model.eval() encoding = tokenizer( sentence, truncation=True, padding="max_length", max_length=128, return_tensors="pt" ) input_ids = encoding["input_ids"].to(device) attention_mask = encoding["attention_mask"].to(device) with torch.no_grad(): outputs = classifier_model(input_ids=input_ids, attention_mask=attention_mask) logits = outputs.logits predicted_label = torch.argmax(logits, dim=1).cpu().item() predicted_groupid = label_to_groupid[str(predicted_label)] return predicted_groupid # Example usage explicitly: sentence = "APT38 has used phishing emails with malicious links to distribute malware." predicted_class = predict_group(sentence) print(f"Predicted GroupID: {predicted_class}") ``` ## Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute). - **Hardware Type:** [To be filled by user] - **Hours used:** [To be filled by user] - **Cloud Provider:** [To be filled by user] - **Compute Region:** [To be filled by user] - **Carbon Emitted:** [To be filled by user] ## Technical Specifications ### Model Architecture - MPNet architecture with classification head (768 -> 512 -> num_labels) - Last 10 transformer layers fine-tuned explicitly ## Environmental Impact Carbon emissions should be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute). ## Model Card Authors - Dženan Hamzić ## Model Card Contact - [More Information Needed]