AIOne-Vision-30B
A multimodal vision-language model specialized in Korean CCTV video understanding and anomaly detection.
Model Description
AIOne-Vision-30B is a videoβtext multimodal model purpose-built for Korean CCTV analysis. It understands what is happening in a surveillance clip, classifies it as one of nine event types (including various anomalies), rates its severity, and proposes operator-ready actions β all in fluent Korean.
Unlike a generic video-captioning model, AIOne-Vision is engineered end-to-end for the control-room use case:
- Compressed for deployment. We start from Qwen2.5-VL-32B-Instruct (64 decoder layers) and apply structured block-level pruning guided by per-layer Block Influence scores, removing the least-informative layers to produce a 30.53B-parameter backbone that fits deployment budgets without giving up the visionβlanguage capabilities of the original model.
- Domain-adapted to Korean surveillance. The pruned backbone is fine-tuned on a curated corpus of real-world Korean CCTV footage spanning urban streets, roads, public spaces, and indoor environments, so scene descriptions sound natural in Korean and align with how human operators actually report what they see.
- Aligned to operator-grade output. A final preference-optimization stage pushes the model toward consistent anomaly labels, schema-valid JSON output, and reduced hallucination on out-of-distribution clips β the behaviors that matter when the output feeds an alerting pipeline instead of a chat window.
Key Capabilities
- Korean CCTV scene understanding β fluent Korean descriptions of people, vehicles, environment, and ongoing actions.
- Anomaly classification across nine event categories:
- Fight / violence (
fight) - Fire / smoke (
fire) - Traffic accident (
traffic_accident) - Fall (
fall) - Safety hazard (
safety_hazard) - Emergency (
emergency) - Weather disaster (
weather_disaster) - Illegal behavior (
illegal_behavior) - Normal (
normal)
- Fight / violence (
- Severity rating β
low / medium / high / critical. - Recommended actions β actionable suggestions an operator can act on immediately.
- Dual output mode
- Natural-language mode: human-readable narrative analysis.
- JSON mode: structured schema output for downstream system integration.
- The same model switches modes based on a single prompt instruction.
Quick Start
Transformers
from transformers import AutoModelForImageTextToText, AutoProcessor
from qwen_vl_utils import process_vision_info
MODEL_ID = "JDONE-Research/AIOne-Vision-30B"
model = AutoModelForImageTextToText.from_pretrained(
MODEL_ID,
torch_dtype="bfloat16",
device_map="auto",
)
processor = AutoProcessor.from_pretrained(MODEL_ID)
messages = [
{"role": "system", "content": "λΉμ μ CCTV μμ λΆμ μ λ¬Έκ°μ
λλ€."},
{
"role": "user",
"content": [
{"type": "video", "video": "file:///path/to/cctv_clip.mp4", "fps": 1.0},
{"type": "text", "text": "μ΄ CCTV μμμ λΆμνμ¬ μ₯λ©΄μ μ€λͺ
νμμμ€."},
],
},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to(model.device)
generated = model.generate(**inputs, max_new_tokens=512)
print(processor.batch_decode(generated[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])
vLLM (recommended for serving)
vllm serve JDONE-Research/AIOne-Vision-30B \
--dtype bfloat16 \
--tensor-parallel-size 2 \
--limit-mm-per-prompt '{"video": 1}' \
--max-model-len 16384
Prompt Guide
Recommended system prompt:
λΉμ μ CCTV μμ λΆμ μ λ¬Έκ°μ
λλ€.
1) Natural-language mode
μ΄ CCTV μμμ λΆμνμ¬ μ₯λ©΄μ μ€λͺ
νμμμ€.
2) Structured (JSON) mode
μ΄ CCTV μμμ λΆμνμ¬ λ€μ JSON μ€ν€λ§λ‘λ§ μλ΅νμμμ€.
{
"event_type": "normal|safety_hazard|traffic_accident|fire|fall|fight|emergency|weather_disaster|illegal_behavior",
"severity_level": "low|medium|high|critical",
"scene_description": "...ν©λλ€.",
"recommended_actions": []
}
Recommended video input settings
| Setting | Recommended value |
|---|---|
| FPS | 1.0 (1 frame per second) |
| Max frames | ~`16` (works best for 10β30s clips) |
| Pixel budget / frame | ~200K pixels (around 224x224) |
Sample Output
Natural-language mode
μΌκ° λμ¬ κ΅μ°¨λ‘μμ λ λͺ μ λ¨μ±μ΄ λλ‘ μμ μμ μλ‘ λ°μΉλ©° λ€ν¬λ λͺ¨μ΅μ΄ νμΈλ©λλ€. ν λͺ μ΄ μλλ₯Ό κ°νκ² λ°μ΄ μ°λ¬λ¨λ¦° λ€ μΆκ° νλ ₯ νμκ° μ΄μ΄μ§κ³ μμ΄ μ¦κ°μ μΈ κ°μ μ΄ νμν΄ λ³΄μ λλ€.
JSON mode
{
"event_type": "fight",
"severity_level": "high",
"scene_description": "μΌκ° λμ¬ κ΅μ°¨λ‘μμ λ λͺ
μ λ¨μ±μ΄ μλ‘λ₯Ό λ°μΉλ©° νλ ₯ νμλ₯Ό 보μ΄κ³ μμ΅λλ€.",
"recommended_actions": [
"κ΄ν μ§κ΅¬λμ μ¦μ μ κ³ ",
"μ£Όλ³ CCTV μΆμ μΉ΄λ©λΌλ‘ μΈλ¬Ό λμ ν보",
"νμ₯ κ΄μ μ¬μκ² μ°μ μλ¦Ό μ‘μΆ"
]
}
Model Specs
| Field | Value |
|---|---|
| Architecture | Qwen2_5_VLForConditionalGeneration |
| Base model | Qwen/Qwen2.5-VL-32B-Instruct |
| Parameters | ~30.53B (pruned from 32B, 6 decoder layers removed) |
| Decoder layers | 58 (original: 64) |
| Modality | Video + Image + Text β Text |
| Precision | bfloat16 |
| Context length | 128K |
| Languages | Korean (primary), English |
Intended Use
- CCTV control-room assistance (surfacing anomaly-event candidates).
- Natural-language search and summarization over video archives.
- Event labeling for security and safety dashboards.
- Research on multimodal video understanding.
Out-of-Scope Use
- Sole-source decision-making with legal consequences (identification, arrest, sanctions, etc.).
- Automated use of force or coercive control based purely on this model's output.
- Any video analysis that infringes on personal privacy, image rights, or applicable data-protection laws.
License
This model is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.
- Non-commercial use, redistribution, and modification are permitted with attribution.
- Commercial use requires a separate agreement.
- Users must also comply with the license terms of the base model, Qwen2.5-VL-32B-Instruct.
Citation
@misc{aione_vision,
title = {AIOne-Vision-30B: A Korean CCTV Video Understanding VLM},
author = {JDONE Research},
year = {2026},
howpublished = {\url{https://huggingface.co/JDONE-Research/AIOne-Vision-30B}}
}
- Downloads last month
- 114
Model tree for JDONE-Research/AIOne-Vision-30B
Base model
Qwen/Qwen2.5-VL-32B-Instruct