You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

AIOne-Vision-30B

A multimodal vision-language model specialized in Korean CCTV video understanding and anomaly detection.

Model Description

AIOne-Vision-30B is a video–text multimodal model purpose-built for Korean CCTV analysis. It understands what is happening in a surveillance clip, classifies it as one of nine event types (including various anomalies), rates its severity, and proposes operator-ready actions — all in fluent Korean.

Unlike a generic video-captioning model, AIOne-Vision is engineered end-to-end for the control-room use case:

Compressed for deployment. We start from Qwen2.5-VL-32B-Instruct (64 decoder layers) and apply structured block-level pruning guided by per-layer Block Influence scores, removing the least-informative layers to produce a 30.53B-parameter backbone that fits deployment budgets without giving up the vision–language capabilities of the original model.
Domain-adapted to Korean surveillance. The pruned backbone is fine-tuned on a curated corpus of real-world Korean CCTV footage spanning urban streets, roads, public spaces, and indoor environments, so scene descriptions sound natural in Korean and align with how human operators actually report what they see.
Aligned to operator-grade output. A final preference-optimization stage pushes the model toward consistent anomaly labels, schema-valid JSON output, and reduced hallucination on out-of-distribution clips — the behaviors that matter when the output feeds an alerting pipeline instead of a chat window.

Key Capabilities

Korean CCTV scene understanding — fluent Korean descriptions of people, vehicles, environment, and ongoing actions.
Anomaly classification across nine event categories:
- Fight / violence (fight)
- Fire / smoke (fire)
- Traffic accident (traffic_accident)
- Fall (fall)
- Safety hazard (safety_hazard)
- Emergency (emergency)
- Weather disaster (weather_disaster)
- Illegal behavior (illegal_behavior)
- Normal (normal)
Severity rating — low / medium / high / critical.
Recommended actions — actionable suggestions an operator can act on immediately.
Dual output mode
- Natural-language mode: human-readable narrative analysis.
- JSON mode: structured schema output for downstream system integration.
- The same model switches modes based on a single prompt instruction.

Quick Start

Transformers

from transformers import AutoModelForImageTextToText, AutoProcessor
from qwen_vl_utils import process_vision_info

MODEL_ID = "JDONE-Research/AIOne-Vision-30B"

model = AutoModelForImageTextToText.from_pretrained(
    MODEL_ID,
    torch_dtype="bfloat16",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(MODEL_ID)

messages = [
    {"role": "system", "content": "당신은 CCTV 영상 분석 전문가입니다."},
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "file:///path/to/cctv_clip.mp4", "fps": 1.0},
            {"type": "text", "text": "이 CCTV 영상을 분석하여 장면을 설명하십시오."},
        ],
    },
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=512)
print(processor.batch_decode(generated[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])

vLLM (recommended for serving)

vllm serve JDONE-Research/AIOne-Vision-30B \
    --dtype bfloat16 \
    --tensor-parallel-size 2 \
    --limit-mm-per-prompt '{"video": 1}' \
    --max-model-len 16384

Prompt Guide

Recommended system prompt:

당신은 CCTV 영상 분석 전문가입니다.

1) Natural-language mode

이 CCTV 영상을 분석하여 장면을 설명하십시오.

2) Structured (JSON) mode

이 CCTV 영상을 분석하여 다음 JSON 스키마로만 응답하십시오.
{
  "event_type": "normal|safety_hazard|traffic_accident|fire|fall|fight|emergency|weather_disaster|illegal_behavior",
  "severity_level": "low|medium|high|critical",
  "scene_description": "...합니다.",
  "recommended_actions": []
}

Setting	Recommended value
FPS	`1.0` (1 frame per second)
Max frames	~`16` (works best for 10–30s clips)
Pixel budget / frame	~200K pixels (around `224x224`)

Sample Output

Natural-language mode

야간 도심 교차로에서 두 명의 남성이 도로 위에 서서 서로 밀치며 다투는 모습이 확인됩니다. 한 명이 상대를 강하게 밀어 쓰러뜨린 뒤 추가 폭력 행위가 이어지고 있어 즉각적인 개입이 필요해 보입니다.

JSON mode

{
  "event_type": "fight",
  "severity_level": "high",
  "scene_description": "야간 도심 교차로에서 두 명의 남성이 서로를 밀치며 폭력 행위를 보이고 있습니다.",
  "recommended_actions": [
    "관할 지구대에 즉시 신고",
    "주변 CCTV 추적 카메라로 인물 동선 확보",
    "현장 관제사에게 우선 알림 송출"
  ]
}

Model Specs

Field	Value
Architecture	`Qwen2_5_VLForConditionalGeneration`
Base model	`Qwen/Qwen2.5-VL-32B-Instruct`
Parameters	~30.53B (pruned from 32B, 6 decoder layers removed)
Decoder layers	58 (original: 64)
Modality	Video + Image + Text → Text
Precision	`bfloat16`
Context length	128K
Languages	Korean (primary), English

Intended Use

CCTV control-room assistance (surfacing anomaly-event candidates).
Natural-language search and summarization over video archives.
Event labeling for security and safety dashboards.
Research on multimodal video understanding.

Out-of-Scope Use

Sole-source decision-making with legal consequences (identification, arrest, sanctions, etc.).
Automated use of force or coercive control based purely on this model's output.
Any video analysis that infringes on personal privacy, image rights, or applicable data-protection laws.

License

This model is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.

Non-commercial use, redistribution, and modification are permitted with attribution.
Commercial use requires a separate agreement.
Users must also comply with the license terms of the base model, Qwen2.5-VL-32B-Instruct.

Citation

@misc{aione_vision,
  title  = {AIOne-Vision-30B: A Korean CCTV Video Understanding VLM},
  author = {JDONE Research},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/JDONE-Research/AIOne-Vision-30B}}
}

Downloads last month: 114

Safetensors

Model size

31B params

Tensor type

BF16

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JDONE-Research/AIOne-Vision-30B

Base model

Qwen/Qwen2.5-VL-32B-Instruct

Finetuned

(67)

this model

JDONE-Research
/

AIOne-Vision-30B