You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

AIOne-Vision

AIOne-Vision-30B

A multimodal vision-language model specialized in Korean CCTV video understanding and anomaly detection.


Model Description

AIOne-Vision-30B is a video–text multimodal model purpose-built for Korean CCTV analysis. It understands what is happening in a surveillance clip, classifies it as one of nine event types (including various anomalies), rates its severity, and proposes operator-ready actions β€” all in fluent Korean.

Unlike a generic video-captioning model, AIOne-Vision is engineered end-to-end for the control-room use case:

  • Compressed for deployment. We start from Qwen2.5-VL-32B-Instruct (64 decoder layers) and apply structured block-level pruning guided by per-layer Block Influence scores, removing the least-informative layers to produce a 30.53B-parameter backbone that fits deployment budgets without giving up the vision–language capabilities of the original model.
  • Domain-adapted to Korean surveillance. The pruned backbone is fine-tuned on a curated corpus of real-world Korean CCTV footage spanning urban streets, roads, public spaces, and indoor environments, so scene descriptions sound natural in Korean and align with how human operators actually report what they see.
  • Aligned to operator-grade output. A final preference-optimization stage pushes the model toward consistent anomaly labels, schema-valid JSON output, and reduced hallucination on out-of-distribution clips β€” the behaviors that matter when the output feeds an alerting pipeline instead of a chat window.

Key Capabilities

  • Korean CCTV scene understanding β€” fluent Korean descriptions of people, vehicles, environment, and ongoing actions.
  • Anomaly classification across nine event categories:
    • Fight / violence (fight)
    • Fire / smoke (fire)
    • Traffic accident (traffic_accident)
    • Fall (fall)
    • Safety hazard (safety_hazard)
    • Emergency (emergency)
    • Weather disaster (weather_disaster)
    • Illegal behavior (illegal_behavior)
    • Normal (normal)
  • Severity rating β€” low / medium / high / critical.
  • Recommended actions β€” actionable suggestions an operator can act on immediately.
  • Dual output mode
    • Natural-language mode: human-readable narrative analysis.
    • JSON mode: structured schema output for downstream system integration.
    • The same model switches modes based on a single prompt instruction.

Quick Start

Transformers

from transformers import AutoModelForImageTextToText, AutoProcessor
from qwen_vl_utils import process_vision_info

MODEL_ID = "JDONE-Research/AIOne-Vision-30B"

model = AutoModelForImageTextToText.from_pretrained(
    MODEL_ID,
    torch_dtype="bfloat16",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(MODEL_ID)

messages = [
    {"role": "system", "content": "당신은 CCTV μ˜μƒ 뢄석 μ „λ¬Έκ°€μž…λ‹ˆλ‹€."},
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "file:///path/to/cctv_clip.mp4", "fps": 1.0},
            {"type": "text", "text": "이 CCTV μ˜μƒμ„ λΆ„μ„ν•˜μ—¬ μž₯면을 μ„€λͺ…ν•˜μ‹­μ‹œμ˜€."},
        ],
    },
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=512)
print(processor.batch_decode(generated[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])

vLLM (recommended for serving)

vllm serve JDONE-Research/AIOne-Vision-30B \
    --dtype bfloat16 \
    --tensor-parallel-size 2 \
    --limit-mm-per-prompt '{"video": 1}' \
    --max-model-len 16384

Prompt Guide

Recommended system prompt:

당신은 CCTV μ˜μƒ 뢄석 μ „λ¬Έκ°€μž…λ‹ˆλ‹€.

1) Natural-language mode

이 CCTV μ˜μƒμ„ λΆ„μ„ν•˜μ—¬ μž₯면을 μ„€λͺ…ν•˜μ‹­μ‹œμ˜€.

2) Structured (JSON) mode

이 CCTV μ˜μƒμ„ λΆ„μ„ν•˜μ—¬ λ‹€μŒ JSON μŠ€ν‚€λ§ˆλ‘œλ§Œ μ‘λ‹΅ν•˜μ‹­μ‹œμ˜€.
{
  "event_type": "normal|safety_hazard|traffic_accident|fire|fall|fight|emergency|weather_disaster|illegal_behavior",
  "severity_level": "low|medium|high|critical",
  "scene_description": "...ν•©λ‹ˆλ‹€.",
  "recommended_actions": []
}

Recommended video input settings

Setting Recommended value
FPS 1.0 (1 frame per second)
Max frames ~`16` (works best for 10–30s clips)
Pixel budget / frame ~200K pixels (around 224x224)

Sample Output

Natural-language mode

μ•Όκ°„ 도심 κ΅μ°¨λ‘œμ—μ„œ 두 λͺ…μ˜ 남성이 λ„λ‘œ μœ„μ— μ„œμ„œ μ„œλ‘œ λ°€μΉ˜λ©° λ‹€νˆ¬λŠ” λͺ¨μŠ΅μ΄ ν™•μΈλ©λ‹ˆλ‹€. ν•œ λͺ…이 μƒλŒ€λ₯Ό κ°•ν•˜κ²Œ λ°€μ–΄ μ“°λŸ¬λœ¨λ¦° λ’€ μΆ”κ°€ 폭λ ₯ ν–‰μœ„κ°€ 이어지고 μžˆμ–΄ 즉각적인 κ°œμž…μ΄ ν•„μš”ν•΄ λ³΄μž…λ‹ˆλ‹€.

JSON mode

{
  "event_type": "fight",
  "severity_level": "high",
  "scene_description": "μ•Όκ°„ 도심 κ΅μ°¨λ‘œμ—μ„œ 두 λͺ…μ˜ 남성이 μ„œλ‘œλ₯Ό λ°€μΉ˜λ©° 폭λ ₯ ν–‰μœ„λ₯Ό 보이고 μžˆμŠ΅λ‹ˆλ‹€.",
  "recommended_actions": [
    "κ΄€ν•  μ§€κ΅¬λŒ€μ— μ¦‰μ‹œ μ‹ κ³ ",
    "μ£Όλ³€ CCTV 좔적 μΉ΄λ©”λΌλ‘œ 인물 동선 확보",
    "ν˜„μž₯ κ΄€μ œμ‚¬μ—κ²Œ μš°μ„  μ•Œλ¦Ό μ†‘μΆœ"
  ]
}

Model Specs

Field Value
Architecture Qwen2_5_VLForConditionalGeneration
Base model Qwen/Qwen2.5-VL-32B-Instruct
Parameters ~30.53B (pruned from 32B, 6 decoder layers removed)
Decoder layers 58 (original: 64)
Modality Video + Image + Text β†’ Text
Precision bfloat16
Context length 128K
Languages Korean (primary), English

Intended Use

  • CCTV control-room assistance (surfacing anomaly-event candidates).
  • Natural-language search and summarization over video archives.
  • Event labeling for security and safety dashboards.
  • Research on multimodal video understanding.

Out-of-Scope Use

  • Sole-source decision-making with legal consequences (identification, arrest, sanctions, etc.).
  • Automated use of force or coercive control based purely on this model's output.
  • Any video analysis that infringes on personal privacy, image rights, or applicable data-protection laws.

License

This model is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.

  • Non-commercial use, redistribution, and modification are permitted with attribution.
  • Commercial use requires a separate agreement.
  • Users must also comply with the license terms of the base model, Qwen2.5-VL-32B-Instruct.

Citation

@misc{aione_vision,
  title  = {AIOne-Vision-30B: A Korean CCTV Video Understanding VLM},
  author = {JDONE Research},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/JDONE-Research/AIOne-Vision-30B}}
}
Downloads last month
114
Safetensors
Model size
31B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for JDONE-Research/AIOne-Vision-30B

Finetuned
(67)
this model