README.md · null822/webshell-detect-bert at main

webshell-detect-bert / README.md

null822

Add comprehensive model card

c5ddfde verified 6 days ago

preview code

raw

history blame contribute delete

5.84 kB

	---
	language:
	- en
	- zh
	library_name: transformers
	tags:
	- security
	- webshell-detection
	- malware-detection
	- cybersecurity
	- code-classification
	- php
	- asp
	- jsp
	- python
	- perl
	license: mit
	datasets:
	- null822/webshell-sample
	base_model:
	- microsoft/codebert-base
	- huawei-noah/TinyBERT_General_4L_312D
	pipeline_tag: text-classification
	widget:
	- text: "<?php eval($_POST['cmd']); ?>"
	example_title: "Malicious WebShell Example"
	- text: "<?php echo 'Hello World'; ?>"
	example_title: "Normal PHP Code"
	---

	# WebShell Detection Models Collection

	## 模型概述 / Model Overview

	这是一个用于检测恶意 WebShell 代码的机器学习模型集合，基于 BERT 架构进行微调。本仓库包含四个模型变体，针对不同的使用场景进行了优化。

	This is a collection of machine learning models for detecting malicious WebShell code, fine-tuned on BERT architectures. The repository contains four model variants optimized for different use cases.

	## 模型变体 / Model Variants

	### 1. full_codebert_model
	- 基础模型: microsoft/codebert-base
	- 训练数据: 多语言数据集（PHP, ASP, JSP, Python, Perl, HTML, JavaScript, Shell等）
	- 参数量: ~125M
	- 特点: 高精度，适合准确性要求高的场景

	### 2. full_tinybert_model
	- 基础模型: huawei-noah/TinyBERT_General_4L_312D
	- 训练数据: 多语言数据集
	- 参数量: ~14.5M
	- 特点: 轻量级，快速推理，适合资源受限环境

	### 3. php_codebert_model
	- 基础模型: microsoft/codebert-base
	- 训练数据: 仅 PHP 代码数据集
	- 参数量: ~125M
	- 特点: 专门针对 PHP WebShell 检测优化

	### 4. php_tinybert_model
	- 基础模型: huawei-noah/TinyBERT_General_4L_312D
	- 训练数据: 仅 PHP 代码数据集
	- 参数量: ~14.5M
	- 特点: PHP 专用轻量级模型

	## 支持的文件类型 / Supported File Types

	- PHP (.php)
	- ASP (.asp, .aspx)
	- JSP (.jsp, .jspx)
	- Python (.py)
	- Perl (.pl)
	- HTML (.html, .htm)
	- JavaScript (.js)
	- Shell scripts (.sh)
	- CGI (.cgi)
	- Java (.java)

	## 使用方法 / Usage

	### 基本使用 / Basic Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	# 选择模型变体 / Choose model variant
	model_name = "null822/webshell-detect-bert"
	subfolder = "full_tinybert_model" # 或其他变体

	# 加载模型 / Load model
	tokenizer = AutoTokenizer.from_pretrained(model_name, subfolder=subfolder)
	model = AutoModelForSequenceClassification.from_pretrained(model_name, subfolder=subfolder)

	def detect_webshell(code_text):
	inputs = tokenizer(code_text, return_tensors="pt", truncation=True, max_length=512)
	with torch.no_grad():
	outputs = model(**inputs)
	prediction = torch.argmax(outputs.logits, dim=1).item()
	return "Malicious WebShell" if prediction == 1 else "Normal Code"

	# 示例 / Example
	code = "<?php eval($_POST['cmd']); ?>"
	result = detect_webshell(code)
	print(result) # 输出: Malicious WebShell
	```

	### 批量检测 / Batch Detection

	```python
	def batch_detect(code_list):
	results = []
	for code in code_list:
	result = detect_webshell(code)
	results.append(result)
	return results

	# 示例 / Example
	codes = [
	"<?php echo 'Hello World'; ?>",
	"<?php eval($_POST['cmd']); ?>",
	"<?php system($_GET['c']); ?>"
	]
	results = batch_detect(codes)
	```

	### 文件检测 / File Detection

	```python
	def detect_file(file_path):
	try:
	with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
	content = f.read()
	return detect_webshell(content)
	except Exception as e:
	return f"Error reading file: {e}"

	# 示例 / Example
	result = detect_file("suspicious_file.php")
	```

	## 模型选择指南 / Model Selection Guide

	\| 使用场景 \| 推荐模型 \| 理由 \|
	\|---------\|---------\|------\|
	\| 生产环境，高精度要求 \| `full_codebert_model` \| 最高准确率 \|
	\| 资源受限，需要快速响应 \| `full_tinybert_model` \| 平衡性能和资源消耗 \|
	\| 专门检测PHP WebShell \| `php_codebert_model` \| PHP优化，高精度 \|
	\| PHP检测，资源受限 \| `php_tinybert_model` \| PHP专用轻量级 \|

	## 性能指标 / Performance Metrics

	模型在测试集上的表现：

	- Accuracy: >95%
	- Precision: >94%
	- Recall: >96%
	- F1-Score: >95%

	具体指标可能因测试数据集而异

	## 训练数据 / Training Data

	- 数据集: [null822/webshell-sample](https://huggingface.co/datasets/null822/webshell-sample)
	- 样本数量: 5000+ 代码样本
	- 数据来源:
	- 正常代码：开源项目和合法代码仓库
	- 恶意代码：已知的 WebShell 样本和恶意脚本
	- 数据处理: Base64编码确保安全传输和存储

	## 限制和注意事项 / Limitations

	1. 上下文长度: 最大支持512个token
	2. 语言支持: 主要针对英文代码和常见编程语言
	3. 误报: 复杂的正常代码可能被误判为恶意
	4. 更新需求: 需要定期使用新的威胁样本重新训练

	## 部署建议 / Deployment Recommendations

	1. 生产环境: 建议使用 `full_codebert_model` 以获得最佳准确性
	2. 边缘设备: 使用 TinyBERT 变体以减少资源消耗
	3. 实时检测: 考虑批处理以提高效率
	4. 安全集成: 结合其他安全工具使用，不应作为唯一防护手段

	## 引用 / Citation

	如果您使用了这些模型，请引用：

	```bibtex
	@misc{webshell-detect-bert,
	title={WebShell Detection Models based on BERT},
	author={null822},
	year={2025},
	publisher={Hugging Face},
	howpublished={\url{https://huggingface.co/null822/webshell-detect-bert}}
	}
	```

	## 许可证 / License

	MIT License

	## 联系方式 / Contact

	如有问题或建议，请通过 GitHub Issues 联系。