This model has been finetuned to perform a single task well: generating hunting rules from malware or threat actor TTP descriptions. As such you need to construct your prompts like this to get the expected result:
Generate a [YARA/SIGMA] rule for the following:\n [malware description]
Here is an example output including thinking token:
========================================
<|im_start|>user
Generate a YARA rule for the following:
Exfiltrated browser login data and informational messages are written to `C:\ProgramData\file.txt`
and copied to `%LOCALAPPDATA%\Packages\Bay0NsQIzx\p.txt` when stealing operations have
completed. If found, targeted browser extensions and wallets have their directories copied to
`%LOCALAPPDATA%\Packages\Bay0NsQIzx`, and a Telegram message is sent indicating the number of
crypto wallets found. The contents of `%LOCALAPPDATA%\Packages\Bay0NsQIzx` are subsequently
compressed into an archive named `output.zip`, located in the same directory. The archive is then
exfiltrated to the Telegram bot and a secondary C2 endpoint hosted at *wetransfers[.]io/uplo.php*, as
shown in **Figure 7** . The domain `wetransfers[.]io` was registered on February 18, 2025, via
NameCheap, Inc., and is currently hosted behind Cloudflare infrastructure.<|im_end|>
<|im_start|>assistant
<think>
Okay, I need to create a YARA rule based on the provided information about malware file paths and strings. Let me start by carefully reading through the details given.
The user mentioned specific files used for storing stolen data:
- Original log file: C:\ProgramData\file.txt
- Copied log file: %LOCALAPPDATA%\Packages\Bay0NsQIzx\p.txt
Then there's an output zip file created from the Bay0NsQIzx folder:
- Archive path: %LOCALAPPDATA%\Packages\Bay0NsQIzx\output.zip (but stored in the same dir? Or maybe destination?)
Also mentioned exfiltration URLs:
- http://wetransfers.io/uplo.php (suspicious domain name similar to WeTransfer)
Strings related to stealer activities might include phrases like "Chrome Extension", "MetaMask", "Telegram" commands or methods. Also look for hardcoded domains/paths like 'uplo.php', t
he domain itself.
For the YARA rule:
Metadata:
Title: Stealer Output File Names / Paths
Author: cti.tools
Description: Detects characteristic filenames and paths used by a stealer, including its output ZIP and upload URL string.
Reference: AI generated rule from the ORKL10K dataset
Condition:
Check if any of these strings appear in the file, likely in code sections related to file saving or network requests.
Need to add the exact file names and paths where possible. For example:
- Check for the creation of p.txt in that Packages directory
- Detection of output.zip being created in that location
- Presence of the uplo.php URL string
Potential rules could focus on:
- Specific file names and paths used for temporary storage/output
- Hardcoded C2/Upload server URL strings
- Strings indicative of targeting wallets/extensions (less reliable alone but useful with other indicators)
So looking at this info, I'll build rules focused on detecting the unique file paths or the domain string used for uploading.
</think>
rule MetaStealer_FilePaths_Indicators {
meta:
author = "cti.tools"
description = "Detects MetaStealer based on unique file paths or upload URL."
reference = "AI generated rule from the ORKL10K dataset"
strings:
$fp_original_log = "C:\\ProgramData\\file.txt" ascii wide
$fp_copied_log = "%LOCALAPPDATA%\\Packages\\Bay0NsQIzx\\p.txt" ascii wide
$fp_output_zip = "%LOCALAPPDATA%\\Packages\\Bay0NsQIzx\\output.zip" ascii wide
$upload_url = "http://wetransfers.io/uplo.php" ascii wide
condition:
uint16(0) == 0x5A4D and filesize < 2MB and ($fp_original_log or $fp_copied_log or $fp_output_zip or $upload_url)
}<|im_end|>
See axolotl config
axolotl version: 0.10.0.dev0
base_model: Qwen/Qwen3-32B
# Automatically upload checkpoint and final model to HF
hub_model_id: ctitools/neurocti-qwen3-32b-orion10k-instruct-fp16
plugins:
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
strict: false
chat_template: qwen3
datasets:
- path: ctitools/orion_10k
type: chat_template
field_messages: messages
split: train[:1%]
message_property_mappings:
role: role
content: content
roles:
user:
- user
assistant:
- assistant
val_set_size: 0.01
output_dir: ./outputs/out
dataset_prepared_path: last_run_prepared
sequence_len: 8192
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true
#load_in_4bit: false
#load_in_8bit: true
adapter: lora
lora_r: 32
lora_alpha: 64
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- down_proj
- up_proj
lora_mlp_kernel: true
lora_qkv_kernel: true
lora_o_kernel: true
bf16: auto
tf32: true
wandb_project: neurocti-hunting
wandb_entity: aaronkaplan
wandb_watch:
wandb_name: neurocti-hunting_lora_neurocti-qwen3-32b-orion10k-instruct-fb16-r32-lr0.0001-sl8192-e3-v1
wandb_log_model:
gradient_accumulation_steps: 2
micro_batch_size: 1
num_epochs: 3
#optimizer: adamw_torch_4bit
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0001
gradient_checkpointing: offload
gradient_checkpointing_kwargs:
use_reentrant: false
resume_from_checkpoint:
logging_steps: 1
flash_attention: true
warmup_steps: 10
evals_per_epoch: 4
saves_per_epoch: 1
weight_decay: 0.0
special_tokens:
# multi-gpu setups:
deepspeed: deepspeed_configs/zero2.json
neurocti-qwen3-32b-orion10k-instruct-fp16
The NeuroCTI Cybersecurity Cyber Threat Intelligence (CTI) model was fine-tuned on the orion10k dataset. The purpose of this model is to help in automatically generating YARA detection rules from unstructured CTI reports.
This model is a fine-tuned version of Qwen/Qwen3-32B on the ctitools/orion_10k dataset. It achieves the following results on the evaluation set:
- Loss: 0.4578
Model description
The NeuroCTI Cybersecurity Cyber Threat Intelligence (CTI) model was fine-tuned on the orion10k dataset. The purpose of this model is to help in automatically generating YARA detection rules from unstructured CTI reports.
Intended uses & limitations
Experimental use for now
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- distributed_type: multi-GPU
- num_devices: 4
- gradient_accumulation_steps: 2
- total_train_batch_size: 8
- total_eval_batch_size: 4
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 10
- num_epochs: 3.0
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
1.6419 | 0.1667 | 1 | 6.5215 |
1.8202 | 0.3333 | 2 | 6.5088 |
1.6672 | 0.6667 | 4 | 6.2647 |
1.6093 | 1.0 | 6 | 5.1988 |
1.5079 | 1.3333 | 8 | 3.3091 |
1.1391 | 1.6667 | 10 | 1.6507 |
0.9673 | 2.0 | 12 | 1.0461 |
0.9455 | 2.3333 | 14 | 0.6177 |
0.7946 | 2.6667 | 16 | 0.4784 |
0.7557 | 3.0 | 18 | 0.4578 |
Framework versions
- PEFT 0.15.2
- Transformers 4.51.3
- Pytorch 2.6.0+cu124
- Datasets 3.5.1
- Tokenizers 0.21.1
- Downloads last month
- 8
Model tree for ctitools/neurocti-qwen3-32b-orion10k-instruct-fp16
Base model
Qwen/Qwen3-32B