EEVE-VSS-SMH-BNB-8bit

8-bit Quantized Version (Production-Ready) | 8-bit ์–‘์žํ™” ๋ฒ„์ „ (ํ”„๋กœ๋•์…˜์šฉ)


English

Model Description

This model is a BitsAndBytes 8-bit quantized version of MyeongHo0621/eeve-vss-smh, optimized for production deployment.

Key Features

  • โœ… Production-Ready: Near-FP16 quality with 50% memory reduction
  • โœ… 8-bit Quantization: Minimal quality loss (<0.5%)
  • โœ… High Stability: More stable than 4-bit for production services
  • โœ… Optimal Balance: Best quality-performance trade-off

Quick Start

Installation

pip install transformers torch bitsandbytes accelerate

Required: bitsandbytes library is mandatory!

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# 8-bit configuration
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "MyeongHo0621/eeve-vss-smh-bnb-8bit",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-8bit")

# Prompt template
def create_prompt(user_input):
    return f"""A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
Human: {user_input}
Assistant: """

# Generate
user_input = "Explain quantum computing"
prompt = create_prompt(user_input)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.3,
    top_p=0.85,
    repetition_penalty=1.0,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

Alternative: Using torch.dtype Directly

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load with explicit dtype
model = AutoModelForCausalLM.from_pretrained(
    "MyeongHo0621/eeve-vss-smh-bnb-8bit",
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-8bit")

Simplified Method (Auto-load quantization config)

from transformers import AutoModelForCausalLM, AutoTokenizer

# Automatically loads saved quantization settings
model = AutoModelForCausalLM.from_pretrained(
    "MyeongHo0621/eeve-vss-smh-bnb-8bit",
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-8bit")

System Requirements

Minimum Specifications

Component Minimum Recommended
GPU RTX 3060 (12GB) RTX 4090 (24GB)
VRAM 10GB 12GB+
RAM 16GB 32GB+
CUDA 11.0+ 12.0+

Tested Environments

  • โœ… RTX 3060 (12GB VRAM) - Works well
  • โœ… RTX 3090 (24GB VRAM) - Excellent
  • โœ… RTX 4090 (24GB VRAM) - Perfect
  • โœ… H100 (80GB VRAM) - Overkill but excellent

Quantization Details

BitsAndBytes 8-bit

Quantization Type: INT8
Bits: 8-bit
Outlier Threshold: 6.0
Method: LLM.int8() with outlier detection
Quality: 99.5% of FP16

Performance Comparison

Version Model Size VRAM Usage Quality Loss Inference Speed Production
FP16 Original ~21GB ~21GB 0% โšกโšกโšกโšก โญโญโญโญโญ
BNB 8-bit ~10.5GB ~10GB <0.5% โšกโšกโšกโšก โญโญโญโญโญ
BNB 4-bit ~5.5GB ~3.5GB 1-2% โšกโšกโšก โญโญโญ

Recommended Generation Parameters

generation_config = {
    "max_new_tokens": 512,
    "temperature": 0.3,
    "top_p": 0.85,
    "repetition_penalty": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.pad_token_id,
    "eos_token_id": tokenizer.eos_token_id,
}

Parameter Guide by Use Case

Use Case Temperature Top P Notes
Factual Answers 0.1-0.3 0.8-0.9 Fact-based questions
Balanced 0.5-0.7 0.85-0.95 General usage
Creative 0.8-1.0 0.9-1.0 Stories, poems

Production Deployment

Why 8-bit for Production?

  • Quality: <0.5% loss compared to FP16 (vs 1-2% for 4-bit)
  • Stability: More consistent outputs
  • Cost-Effective: 50% memory reduction vs FP16
  • Battle-Tested: LLM.int8() algorithm widely used in production

Deployment Architecture

Load Balancer
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Server 1  โ”‚   Server 2  โ”‚   Server 3  โ”‚
โ”‚   RTX 4090  โ”‚   RTX 4090  โ”‚   RTX 4090  โ”‚
โ”‚   8-bit     โ”‚   8-bit     โ”‚   8-bit     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Cost: ~60% of FP16 deployment
Quality: 99.5% of FP16

Example Outputs

Korean Response Quality

Input:

WMS ์‹œ์Šคํ…œ์˜ ํ•ต์‹ฌ ๊ธฐ๋Šฅ 3๊ฐ€์ง€๋ฅผ ์„ค๋ช…ํ•ด์ฃผ์„ธ์š”

Output:

WMS(Warehouse Management System) ์‹œ์Šคํ…œ์˜ ํ•ต์‹ฌ ๊ธฐ๋Šฅ 3๊ฐ€์ง€๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

1. ์žฌ๊ณ  ๊ด€๋ฆฌ (Inventory Management)
   - ์‹ค์‹œ๊ฐ„ ์žฌ๊ณ  ์ถ”์  ๋ฐ ๊ฐ€์‹œ์„ฑ ์ œ๊ณต
   - ์ž…์ถœ๊ณ  ์ž๋™ํ™” ๋ฐ ์ •ํ™•๋„ ํ–ฅ์ƒ
   - ์žฌ๊ณ  ํšŒ์ „์œจ ์ตœ์ ํ™”

2. ์ฃผ๋ฌธ ์ฒ˜๋ฆฌ (Order Fulfillment)
   - ํ”ผํ‚น, ํŒจํ‚น, ๋ฐฐ์†ก ํ”„๋กœ์„ธ์Šค ์ž๋™ํ™”
   - ์ฃผ๋ฌธ ์šฐ์„ ์ˆœ์œ„ ๊ด€๋ฆฌ
   - ๋ฐฐ์†ก ์ •ํ™•๋„ ํ–ฅ์ƒ

3. ์ฐฝ๊ณ  ์ตœ์ ํ™” (Warehouse Optimization)
   - ๊ณต๊ฐ„ ํ™œ์šฉ ๊ทน๋Œ€ํ™”
   - ๋™์„  ์ตœ์ ํ™”
   - ์ž‘์—… ์ƒ์‚ฐ์„ฑ ํ–ฅ์ƒ

์ด๋Ÿฌํ•œ ๊ธฐ๋Šฅ๋“ค์„ ํ†ตํ•ด ๋ฌผ๋ฅ˜ ํšจ์œจ์„ฑ์„ ํฌ๊ฒŒ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Original Model Information

This is a quantized version of:

For detailed training process, see original model page.

Troubleshooting

CUDA Out of Memory

# Reduce max_new_tokens
generation_config = {
    "max_new_tokens": 256,  # 512 โ†’ 256
    ...
}

bitsandbytes Installation Error

# Check CUDA version
nvidia-smi

# CUDA 11.x
pip install bitsandbytes

# CUDA 12.x
pip install bitsandbytes --upgrade

Use Cases

โœ… Ideal For

  • Production deployments
  • API services with SLA requirements
  • High-throughput applications
  • Cost-sensitive deployments
  • Quality-critical applications

โš ๏ธ Consider Alternatives If

Limitations

  • Requires ~10GB VRAM (vs 3.5GB for 4-bit)
  • <0.5% quality loss compared to FP16
  • Requires bitsandbytes library
  • Windows may require additional setup

License

Citation

@misc{eeve-vss-smh-bnb-8bit-2025,
  author = {MyeongHo0621},
  title = {EEVE-VSS-SMH-BNB-8bit: 8-bit Quantized Korean Model for Production},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/MyeongHo0621/eeve-vss-smh-bnb-8bit}},
  note = {8-bit quantized version using BitsAndBytes LLM.int8()}
}

Acknowledgments

Related Models

Model Size VRAM Quality Use Case
eeve-vss-smh 21GB 21GB 100% High-end GPUs
eeve-vss-smh-bnb-8bit 10.5GB 10GB 99.5% Production โญ
eeve-vss-smh-bnb-4bit 5.5GB 3.5GB 98% Low-VRAM

Contact


Quantization Date: 2025-10-11
Method: BitsAndBytes LLM.int8()
Status: Production-Ready ๐Ÿš€


ํ•œ๊ตญ์–ด

๋ชจ๋ธ ์†Œ๊ฐœ

์ด ๋ชจ๋ธ์€ MyeongHo0621/eeve-vss-smh๋ฅผ BitsAndBytes 8-bit๋กœ ์–‘์žํ™”ํ•œ ํ”„๋กœ๋•์…˜์šฉ ๋ฒ„์ „์ž…๋‹ˆ๋‹ค.

์ฃผ์š” ํŠน์ง•

  • โœ… ํ”„๋กœ๋•์…˜ ์ตœ์ ํ™”: FP16๊ณผ ๊ฑฐ์˜ ๋™์ผํ•œ ํ’ˆ์งˆ๋กœ ๋ฉ”๋ชจ๋ฆฌ 50% ์ ˆ๊ฐ
  • โœ… 8-bit ์–‘์žํ™”: ํ’ˆ์งˆ ์†์‹ค ์ตœ์†Œ (<0.5%)
  • โœ… ๋†’์€ ์•ˆ์ •์„ฑ: 4-bit๋ณด๋‹ค ํ”„๋กœ๋•์…˜ ์„œ๋น„์Šค์— ์•ˆ์ •์ 
  • โœ… ์ตœ์  ๊ท ํ˜•: ํ’ˆ์งˆ๊ณผ ์„ฑ๋Šฅ์˜ ์ตœ๊ณ  ์กฐํ•ฉ

๋น ๋ฅธ ์‹œ์ž‘

์„ค์น˜

pip install transformers torch bitsandbytes accelerate

ํ•„์ˆ˜: bitsandbytes ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ๋ฐ˜๋“œ์‹œ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค!

๊ธฐ๋ณธ ์‚ฌ์šฉ

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# 8-bit ์„ค์ •
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0
)

# ๋ชจ๋ธ ๋กœ๋“œ
model = AutoModelForCausalLM.from_pretrained(
    "MyeongHo0621/eeve-vss-smh-bnb-8bit",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-8bit")

# ํ”„๋กฌํ”„ํŠธ ํ…œํ”Œ๋ฆฟ
def create_prompt(user_input):
    return f"""A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
Human: {user_input}
Assistant: """

# ๋Œ€ํ™”
user_input = "์–‘์ž ์ปดํ“จํŒ…์— ๋Œ€ํ•ด ์„ค๋ช…ํ•ด์ฃผ์„ธ์š”"
prompt = create_prompt(user_input)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.3,
    top_p=0.85,
    repetition_penalty=1.0,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

๋Œ€์•ˆ: torch.dtype ์ง์ ‘ ์‚ฌ์šฉ

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# dtype ๋ช…์‹œ์  ์ง€์ •
model = AutoModelForCausalLM.from_pretrained(
    "MyeongHo0621/eeve-vss-smh-bnb-8bit",
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-8bit")

๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ• (์ €์žฅ๋œ ์„ค์ • ์ž๋™ ๋กœ๋“œ)

from transformers import AutoModelForCausalLM, AutoTokenizer

# ์ €์žฅ๋œ ์–‘์žํ™” ์„ค์ •์„ ์ž๋™์œผ๋กœ ๋กœ๋“œ
model = AutoModelForCausalLM.from_pretrained(
    "MyeongHo0621/eeve-vss-smh-bnb-8bit",
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-8bit")

์‹œ์Šคํ…œ ์š”๊ตฌ์‚ฌํ•ญ

์ตœ์†Œ ์‚ฌ์–‘

๊ตฌ์„ฑ ์š”์†Œ ์ตœ์†Œ ์‚ฌ์–‘ ์ถ”์ฒœ ์‚ฌ์–‘
GPU RTX 3060 (12GB) RTX 4090 (24GB)
VRAM 10GB 12GB+
RAM 16GB 32GB+
CUDA 11.0+ 12.0+

ํ…Œ์ŠคํŠธ๋œ ํ™˜๊ฒฝ

  • โœ… RTX 3060 (12GB VRAM) - ์›ํ™œํ•˜๊ฒŒ ์ž‘๋™
  • โœ… RTX 3090 (24GB VRAM) - ํ›Œ๋ฅญํ•จ
  • โœ… RTX 4090 (24GB VRAM) - ์™„๋ฒฝํ•จ
  • โœ… H100 (80GB VRAM) - ์˜ค๋ฒ„ํ‚ฌ์ด์ง€๋งŒ ์™„๋ฒฝ

์–‘์žํ™” ์„ธ๋ถ€์‚ฌํ•ญ

BitsAndBytes 8-bit

Quantization Type: INT8
Bits: 8-bit
Outlier Threshold: 6.0
Method: LLM.int8() with outlier detection
Quality: FP16์˜ 99.5%

์„ฑ๋Šฅ ๋น„๊ต

๋ฒ„์ „ ๋ชจ๋ธ ํฌ๊ธฐ VRAM ์‚ฌ์šฉ ํ’ˆ์งˆ ์†์‹ค ์ถ”๋ก  ์†๋„ ํ”„๋กœ๋•์…˜
FP16 ์›๋ณธ ~21GB ~21GB 0% โšกโšกโšกโšก โญโญโญโญโญ
BNB 8-bit ~10.5GB ~10GB <0.5% โšกโšกโšกโšก โญโญโญโญโญ
BNB 4-bit ~5.5GB ~3.5GB 1-2% โšกโšกโšก โญโญโญ

์ถ”์ฒœ ์ƒ์„ฑ ํŒŒ๋ผ๋ฏธํ„ฐ

generation_config = {
    "max_new_tokens": 512,
    "temperature": 0.3,
    "top_p": 0.85,
    "repetition_penalty": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.pad_token_id,
    "eos_token_id": tokenizer.eos_token_id,
}

์šฉ๋„๋ณ„ ํŒŒ๋ผ๋ฏธํ„ฐ

์šฉ๋„ Temperature Top P ์„ค๋ช…
์ •ํ™•ํ•œ ๋‹ต๋ณ€ 0.1-0.3 0.8-0.9 ์‚ฌ์‹ค ๊ธฐ๋ฐ˜ ์งˆ๋ฌธ
๊ท ํ˜• ๋‹ต๋ณ€ 0.5-0.7 0.85-0.95 ์ผ๋ฐ˜์  ์‚ฌ์šฉ
์ฐฝ์˜์  ๋‹ต๋ณ€ 0.8-1.0 0.9-1.0 ์Šคํ† ๋ฆฌ, ์‹œ ๋“ฑ

ํ”„๋กœ๋•์…˜ ๋ฐฐํฌ

ํ”„๋กœ๋•์…˜์— 8-bit๋ฅผ ์„ ํƒํ•˜๋Š” ์ด์œ ?

  • ํ’ˆ์งˆ: FP16 ๋Œ€๋น„ <0.5% ์†์‹ค (4-bit๋Š” 1-2%)
  • ์•ˆ์ •์„ฑ: ๋” ์ผ๊ด€๋œ ์ถœ๋ ฅ
  • ๋น„์šฉ ํšจ์œจ: FP16 ๋Œ€๋น„ 50% ๋ฉ”๋ชจ๋ฆฌ ์ ˆ๊ฐ
  • ๊ฒ€์ฆ๋œ ๊ธฐ์ˆ : LLM.int8() ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํ”„๋กœ๋•์…˜์—์„œ ๋„๋ฆฌ ์‚ฌ์šฉ๋จ

๋ฐฐํฌ ์•„ํ‚คํ…์ฒ˜

๋กœ๋“œ ๋ฐธ๋Ÿฐ์„œ
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   ์„œ๋ฒ„ 1    โ”‚   ์„œ๋ฒ„ 2    โ”‚   ์„œ๋ฒ„ 3    โ”‚
โ”‚   RTX 4090  โ”‚   RTX 4090  โ”‚   RTX 4090  โ”‚
โ”‚   8-bit     โ”‚   8-bit     โ”‚   8-bit     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๋น„์šฉ: FP16 ๋ฐฐํฌ์˜ ~60%
ํ’ˆ์งˆ: FP16์˜ 99.5%

์„ฑ๋Šฅ ์˜ˆ์‹œ

ํ•œ๊ตญ์–ด ์‘๋‹ต ํ’ˆ์งˆ

์ž…๋ ฅ:

WMS ์‹œ์Šคํ…œ์˜ ํ•ต์‹ฌ ๊ธฐ๋Šฅ 3๊ฐ€์ง€๋ฅผ ์„ค๋ช…ํ•ด์ฃผ์„ธ์š”

์ถœ๋ ฅ:

WMS(Warehouse Management System) ์‹œ์Šคํ…œ์˜ ํ•ต์‹ฌ ๊ธฐ๋Šฅ 3๊ฐ€์ง€๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

1. ์žฌ๊ณ  ๊ด€๋ฆฌ (Inventory Management)
   - ์‹ค์‹œ๊ฐ„ ์žฌ๊ณ  ์ถ”์  ๋ฐ ๊ฐ€์‹œ์„ฑ ์ œ๊ณต
   - ์ž…์ถœ๊ณ  ์ž๋™ํ™” ๋ฐ ์ •ํ™•๋„ ํ–ฅ์ƒ
   - ์žฌ๊ณ  ํšŒ์ „์œจ ์ตœ์ ํ™”

2. ์ฃผ๋ฌธ ์ฒ˜๋ฆฌ (Order Fulfillment)
   - ํ”ผํ‚น, ํŒจํ‚น, ๋ฐฐ์†ก ํ”„๋กœ์„ธ์Šค ์ž๋™ํ™”
   - ์ฃผ๋ฌธ ์šฐ์„ ์ˆœ์œ„ ๊ด€๋ฆฌ
   - ๋ฐฐ์†ก ์ •ํ™•๋„ ํ–ฅ์ƒ

3. ์ฐฝ๊ณ  ์ตœ์ ํ™” (Warehouse Optimization)
   - ๊ณต๊ฐ„ ํ™œ์šฉ ๊ทน๋Œ€ํ™”
   - ๋™์„  ์ตœ์ ํ™”
   - ์ž‘์—… ์ƒ์‚ฐ์„ฑ ํ–ฅ์ƒ

์ด๋Ÿฌํ•œ ๊ธฐ๋Šฅ๋“ค์„ ํ†ตํ•ด ๋ฌผ๋ฅ˜ ํšจ์œจ์„ฑ์„ ํฌ๊ฒŒ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์›๋ณธ ๋ชจ๋ธ ์ •๋ณด

์ด ๋ชจ๋ธ์€ ๋‹ค์Œ ๋ชจ๋ธ์˜ ์–‘์žํ™” ๋ฒ„์ „์ž…๋‹ˆ๋‹ค:

  • ์›๋ณธ ๋ชจ๋ธ: MyeongHo0621/eeve-vss-smh
  • ๋ฒ ์ด์Šค ๋ชจ๋ธ: yanolja/EEVE-Korean-Instruct-10.8B-v1.0
  • ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ: 100K+ ๊ณ ํ’ˆ์งˆ ํ•œ๊ตญ์–ด instruction ๋ฐ์ดํ„ฐ
  • LoRA ์„ค์ •: r=64, alpha=128, dropout=0.05

์ž์„ธํ•œ ํ›ˆ๋ จ ๊ณผ์ •์€ ์›๋ณธ ๋ชจ๋ธ ํŽ˜์ด์ง€๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

๋ฌธ์ œ ํ•ด๊ฒฐ

CUDA Out of Memory

# max_new_tokens ์ค„์ด๊ธฐ
generation_config = {
    "max_new_tokens": 256,  # 512 โ†’ 256
    ...
}

bitsandbytes ์„ค์น˜ ์˜ค๋ฅ˜

# CUDA ๋ฒ„์ „ ํ™•์ธ
nvidia-smi

# CUDA 11.x
pip install bitsandbytes

# CUDA 12.x
pip install bitsandbytes --upgrade

์‚ฌ์šฉ ์‚ฌ๋ก€

โœ… ์ ํ•ฉํ•œ ๊ฒฝ์šฐ

  • ํ”„๋กœ๋•์…˜ ๋ฐฐํฌ
  • SLA ์š”๊ตฌ์‚ฌํ•ญ์ด ์žˆ๋Š” API ์„œ๋น„์Šค
  • ๋†’์€ ์ฒ˜๋ฆฌ๋Ÿ‰ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜
  • ๋น„์šฉ์— ๋ฏผ๊ฐํ•œ ๋ฐฐํฌ
  • ํ’ˆ์งˆ์ด ์ค‘์š”ํ•œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜

โš ๏ธ ๋Œ€์•ˆ ๊ณ ๋ ค ์‚ฌํ•ญ

์ œํ•œ์‚ฌํ•ญ

  • ~10GB VRAM ํ•„์š” (4-bit๋Š” 3.5GB)
  • FP16 ๋Œ€๋น„ <0.5% ํ’ˆ์งˆ ์†์‹ค
  • bitsandbytes ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ•„์ˆ˜
  • Windows์—์„œ ์ถ”๊ฐ€ ์„ค์ • ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Œ

๋ผ์ด์„ ์Šค

  • ๋ชจ๋ธ ๋ผ์ด์„ ์Šค: CC-BY-NC-SA-4.0
  • ๋ฒ ์ด์Šค ๋ชจ๋ธ: EEVE-Korean-Instruct-10.8B-v1.0
  • ์ƒ์—…์  ์‚ฌ์šฉ: ์ œํ•œ์  (๋ผ์ด์„ ์Šค ์ฐธ์กฐ)

Citation

@misc{eeve-vss-smh-bnb-8bit-2025,
  author = {MyeongHo0621},
  title = {EEVE-VSS-SMH-BNB-8bit: 8-bit Quantized Korean Model for Production},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/MyeongHo0621/eeve-vss-smh-bnb-8bit}},
  note = {8-bit quantized version using BitsAndBytes LLM.int8()}
}

Acknowledgments

  • ์›๋ณธ ๋ชจ๋ธ: MyeongHo0621/eeve-vss-smh
  • ๋ฒ ์ด์Šค ๋ชจ๋ธ: Yanolja EEVE
  • ์–‘์žํ™” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ: BitsAndBytes
  • ํ”„๋ ˆ์ž„์›Œํฌ: Hugging Face Transformers

๊ด€๋ จ ๋ชจ๋ธ

๋ชจ๋ธ ํฌ๊ธฐ VRAM ํ’ˆ์งˆ ์šฉ๋„
eeve-vss-smh 21GB 21GB 100% ๊ณ ์‚ฌ์–‘ GPU
eeve-vss-smh-bnb-8bit 10.5GB 10GB 99.5% ํ”„๋กœ๋•์…˜ โญ
eeve-vss-smh-bnb-4bit 5.5GB 3.5GB 98% ์ €VRAM

Contact


์–‘์žํ™” ์ผ์ž: 2025-10-11
๋ฐฉ๋ฒ•: BitsAndBytes LLM.int8()
์ƒํƒœ: ํ”„๋กœ๋•์…˜ ์ค€๋น„ ์™„๋ฃŒ ๐Ÿš€

Downloads last month
3
Safetensors
Model size
11B params
Tensor type
F32
ยท
F16
ยท
I8
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for MyeongHo0621/eeve-vss-smh-bnb-8bit