|
--- |
|
language: |
|
- ko |
|
license: apache-2.0 |
|
tags: |
|
- text2sql |
|
- spider |
|
- korean |
|
- llama |
|
- text-generation |
|
- table-question-answering |
|
datasets: |
|
- spider |
|
- huggingface-KREW/spider-ko |
|
base_model: unsloth/Meta-Llama-3.1-8B-Instruct |
|
model-index: |
|
- name: Llama-3.1-8B-Spider-SQL-Ko |
|
results: |
|
- task: |
|
type: text2sql |
|
name: Text to SQL |
|
dataset: |
|
name: Spider (Korean) |
|
type: text2sql |
|
metrics: |
|
- type: exact_match |
|
value: 42.65 |
|
- type: execution_accuracy |
|
value: 65.47 |
|
--- |
|
|
|
# Llama-3.1-8B-Spider-SQL-Ko |
|
|
|
ํ๊ตญ์ด ์ง๋ฌธ์ SQL ์ฟผ๋ฆฌ๋ก ๋ณํํ๋ Text-to-SQL ๋ชจ๋ธ์
๋๋ค. |
|
[Spider](https://yale-lily.github.io/spider) train ๋ฐ์ดํฐ์
์ ํ๊ตญ์ด๋ก ๋ฒ์ญํ [spider-ko](https://huggingface.co/datasets/huggingface-KREW/spider-ko) ๋ฐ์ดํฐ์
์ ํ์ฉํ์ฌ ๋ฏธ์ธ์กฐ์ ํ์์ต๋๋ค. |
|
|
|
## ๐ ์ฃผ์ ์ฑ๋ฅ |
|
|
|
Spider ํ๊ตญ์ด ๊ฒ์ฆ ๋ฐ์ดํฐ์
(1,034๊ฐ) ํ๊ฐ ๊ฒฐ๊ณผ: |
|
- **์ ํ ์ผ์น์จ**: 42.65% (441/1034) |
|
- **์คํ ์ ํ๋**: 65.47% (677/1034) |
|
|
|
> ๐ก ์คํ ์ ํ๋๊ฐ ์ ํ ์ผ์น์จ๋ณด๋ค ๋์ ์ด์ ๋, SQL ๋ฌธ๋ฒ์ด ๋ค๋ฅด๋๋ผ๋ ๋์ผํ ๊ฒฐ๊ณผ๋ฅผ ๋ฐํํ๋ ๊ฒฝ์ฐ๊ฐ ๋ง๊ธฐ ๋๋ฌธ์
๋๋ค. |
|
|
|
## ๐ ๋ฐ๋ก ์์ํ๊ธฐ |
|
|
|
```python |
|
from unsloth import FastLanguageModel |
|
|
|
# ๋ชจ๋ธ ๋ถ๋ฌ์ค๊ธฐ |
|
model, tokenizer = FastLanguageModel.from_pretrained( |
|
model_name="huggingface-KREW/Llama-3.1-8B-Spider-SQL-Ko", |
|
max_seq_length=2048, |
|
dtype=None, |
|
load_in_4bit=True, |
|
) |
|
|
|
# ํ๊ตญ์ด ์ง๋ฌธ โ SQL ๋ณํ |
|
question = "๊ฐ์๋ ๋ช ๋ช
์ด ์๋์?" |
|
schema = """ํ
์ด๋ธ: singer |
|
์ปฌ๋ผ: singer_id, name, country, age""" |
|
|
|
prompt = f"""๋ฐ์ดํฐ๋ฒ ์ด์ค ์คํค๋ง: |
|
{schema} |
|
|
|
์ง๋ฌธ: {question} |
|
SQL:""" |
|
|
|
# ๊ฒฐ๊ณผ: SELECT count(*) FROM singer |
|
``` |
|
|
|
## ๐ ๋ชจ๋ธ ์๊ฐ |
|
|
|
- **๊ธฐ๋ฐ ๋ชจ๋ธ**: Llama 3.1 8B Instruct (4bit ์์ํ) |
|
- **ํ์ต ๋ฐ์ดํฐ**: [spider-ko](https://huggingface.co/datasets/huggingface-KREW/spider-ko) (1-epoch) |
|
- **์ง์ DB**: 166๊ฐ์ ๋ค์ํ ๋๋ฉ์ธ ๋ฐ์ดํฐ๋ฒ ์ด์ค ( [spider dataset](https://yale-lily.github.io/spider) ) |
|
- **ํ์ต ๋ฐฉ๋ฒ**: LoRA (r=16, alpha=32) |
|
|
|
## ๐ฌ ํ์ฉ ์์ |
|
|
|
### ๊ธฐ๋ณธ ์ฌ์ฉ๋ฒ |
|
|
|
```python |
|
def generate_sql(question, schema_info): |
|
"""ํ๊ตญ์ด ์ง๋ฌธ์ SQL๋ก ๋ณํ""" |
|
prompt = f"""๋ค์ ๋ฐ์ดํฐ๋ฒ ์ด์ค ์คํค๋ง๋ฅผ ์ฐธ๊ณ ํ์ฌ ์ง๋ฌธ์ ๋ํ SQL ์ฟผ๋ฆฌ๋ฅผ ์์ฑํ์ธ์. |
|
|
|
### ๋ฐ์ดํฐ๋ฒ ์ด์ค ์คํค๋ง: |
|
{schema_info} |
|
|
|
### ์ง๋ฌธ: {question} |
|
|
|
### SQL ์ฟผ๋ฆฌ:""" |
|
|
|
messages = [{"role": "user", "content": prompt}] |
|
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt") |
|
|
|
outputs = model.generate(inputs, max_new_tokens=150, temperature=0.1) |
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
|
return response.split("### SQL ์ฟผ๋ฆฌ:")[-1].strip() |
|
``` |
|
|
|
### ์ค์ ์ฌ์ฉ ์์ |
|
|
|
```python |
|
# ์์ 1: ์ง๊ณ ํจ์ |
|
question = "๋ถ์์ฅ๋ค ์ค 56์ธ๋ณด๋ค ๋์ด๊ฐ ๋ง์ ์ฌ๋์ด ๋ช ๋ช
์
๋๊น?" |
|
# ๊ฒฐ๊ณผ: SELECT count(*) FROM head WHERE age > 56 |
|
|
|
# ์์ 2: ์กฐ์ธ |
|
question = "๊ฐ์ฅ ๋ง์ ๋ํ๋ฅผ ๊ฐ์ตํ ๋์์ ์ํ๋ ๋ฌด์์ธ๊ฐ์?" |
|
# ๊ฒฐ๊ณผ: SELECT T1.Status FROM city AS T1 JOIN farm_competition AS T2 ON T1.City_ID = T2.Host_city_ID GROUP BY T2.Host_city_ID ORDER BY COUNT(*) DESC LIMIT 1 |
|
|
|
# ์์ 3: ์๋ธ์ฟผ๋ฆฌ |
|
question = "๊ธฐ์
๊ฐ๊ฐ ์๋ ์ฌ๋๋ค์ ์ด๋ฆ์ ๋ฌด์์
๋๊น?" |
|
# ๊ฒฐ๊ณผ: SELECT Name FROM people WHERE People_ID NOT IN (SELECT People_ID FROM entrepreneur) |
|
``` |
|
|
|
## โ ๏ธ ์ฌ์ฉ ์ ์ฃผ์์ฌํญ |
|
|
|
### ์ ํ์ฌํญ |
|
- โ
์์ด ํ
์ด๋ธ/์ปฌ๋ผ๋ช
์ฌ์ฉ (ํ๊ตญ์ด ์ง๋ฌธ โ ์์ด SQL) |
|
- โ
Spider ๋ฐ์ดํฐ์
๋๋ฉ์ธ์ ์ต์ ํ |
|
- โ NoSQL, ๊ทธ๋ํ DB ๋ฏธ์ง์ |
|
- โ ๋งค์ฐ ๋ณต์กํ ์ค์ฒฉ ์ฟผ๋ฆฌ๋ ์ ํ๋ ํ๋ฝ |
|
|
|
## ๐ง ๊ธฐ์ ์ฌ์ |
|
|
|
### ํ์ต ํ๊ฒฝ |
|
- **GPU**: NVIDIA Tesla T4 (16GB) |
|
- **ํ์ต ์๊ฐ**: ์ฝ 4์๊ฐ |
|
- **๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ**: ์ต๋ 7.6GB VRAM |
|
|
|
### ํ์ดํผํ๋ผ๋ฏธํฐ |
|
```python |
|
training_args = { |
|
"per_device_train_batch_size": 2, |
|
"gradient_accumulation_steps": 4, |
|
"learning_rate": 5e-4, |
|
"num_train_epochs": 1, |
|
"optimizer": "adamw_8bit", |
|
"lr_scheduler_type": "cosine", |
|
"warmup_ratio": 0.05 |
|
} |
|
|
|
lora_config = { |
|
"r": 16, |
|
"lora_alpha": 32, |
|
"lora_dropout": 0, |
|
"target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", |
|
"gate_proj", "up_proj", "down_proj"] |
|
} |
|
``` |
|
|
|
## ๐ ์ฐธ๊ณ ์๋ฃ |
|
|
|
### ์ธ์ฉ |
|
```bibtex |
|
@misc{llama31_spider_sql_ko_2025, |
|
title={Llama-3.1-8B-Spider-SQL-Ko: Korean Text-to-SQL Model}, |
|
author={[Sohyun Sim, Youngjun Cho, Seongwoo Choi]}, |
|
year={2025}, |
|
publisher={Hugging Face KREW}, |
|
url={https://huggingface.co/huggingface-KREW/Llama-3.1-8B-Spider-SQL-Ko} |
|
} |
|
``` |
|
|
|
### ๊ด๋ จ ๋
ผ๋ฌธ |
|
- [Spider: A Large-Scale Human-Labeled Dataset](https://arxiv.org/abs/1809.08887) (Yu et al., 2018) |
|
|
|
## ๐ค ๊ธฐ์ฌ์ |
|
|
|
[@sim-so](https://huggingface.co/sim-so), [@choincnp](https://huggingface.co/choincnp), [@nuatmochoi](https://huggingface.co/nuatmochoi) |
|
|