| # SimToken Setup |
|
|
| --- |
|
|
| ## 1. Create Environment |
|
|
| ```bash |
| conda create -n simtoken python=3.10 -y |
| conda activate simtoken |
| |
| python -m pip install --upgrade pip wheel "setuptools<81" |
| |
| pip install \ |
| torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 \ |
| --index-url https://download.pytorch.org/whl/cu121 |
| |
| pip install \ |
| transformers==4.30.2 \ |
| peft==0.2.0 \ |
| accelerate==0.21.0 \ |
| sentencepiece \ |
| protobuf \ |
| safetensors \ |
| numpy==1.26.4 \ |
| pandas \ |
| matplotlib \ |
| opencv-python \ |
| pillow \ |
| tqdm \ |
| einops \ |
| timm \ |
| requests \ |
| towhee \ |
| huggingface_hub |
| ``` |
|
|
| --- |
|
|
| ## 2. Download from HuggingFace(新机器初始化) |
|
|
| 登录 HuggingFace(token 在 https://huggingface.co/settings/tokens 生成): |
|
|
| ```bash |
| huggingface-cli login |
| ``` |
|
|
| 下载完整 repo(代码 + 权重 + 压缩数据包,共约 190G): |
|
|
| ```bash |
| mkdir -p /workspace/SimToken |
| cd /workspace/SimToken |
| |
| huggingface-cli download yfan07/SimToken \ |
| --repo-type model \ |
| --local-dir . \ |
| --local-dir-use-symlinks False |
| ``` |
|
|
| 下载完成后解压数据包: |
|
|
| ```bash |
| cd /workspace/SimToken/data |
| |
| tar -xf image_embed.tar # ~5–10 分钟 |
| tar -xzf gt_mask.tar.gz |
| tar -xzf audio_embed.tar.gz |
| tar -xf media.tar |
| ``` |
|
|
|
|
| --- |
|
|
| ## 3. Pre-download Model Weights(首次使用必做) |
|
|
| `transformers==4.30.2` 与新版 `huggingface_hub` 存在 API 不兼容(`use_auth_token` 已移除)。 |
| 解决方案:先用 CLI 将模型下载到本地缓存,之后运行实验时加 `TRANSFORMERS_OFFLINE=1`,跳过所有网络请求。 |
|
|
| ```bash |
| # Chat-UniVi-7B(~14G) |
| huggingface-cli download Chat-UniVi/Chat-UniVi-7B-v1.5 |
| |
| # CLIP ViT-L(~1.6G) |
| huggingface-cli download openai/clip-vit-large-patch14 |
| ``` |
|
|
| 下载完成后即永久缓存,新 session 无需重复下载。 |
|
|
| --- |
|
|
| ## 4. Example Evaluation |
|
|
| 所有评测命令统一加 `TRANSFORMERS_OFFLINE=1`: |
|
|
| ```bash |
| cd /workspace/SimToken |
| |
| # Unseen split(全量 1656 样本) |
| TRANSFORMERS_OFFLINE=1 python -W ignore load_model.py --eval_split test_u |
| |
| # Seen split |
| TRANSFORMERS_OFFLINE=1 python -W ignore load_model.py --eval_split test_s |
| |
| # Null split(S metric,越低越好) |
| TRANSFORMERS_OFFLINE=1 python -W ignore load_model.py --eval_split test_n |
| |
| # 限制样本数(快速验证) |
| TRANSFORMERS_OFFLINE=1 python -W ignore load_model.py --eval_split test_u --max_eval_rows 50 |
| |
| # Stage 0 梯度连通性 + bypass 等价性检查(仅诊断) |
| TRANSFORMERS_OFFLINE=1 python -W ignore load_model.py --eval_split test_u --max_eval_rows 0 |
| ``` |
|
|
| 每次评估依次输出:Baseline + q-LTPO Stage 1 两组结果及诊断统计。 |
|
|
| --- |
|
|
| ## 5. Upload to HuggingFace(实验结束后) |
|
|
| 数据目录以压缩包形式存储,可大幅减少文件数量,避免 HuggingFace commit 频率限制。 |
|
|
| **第一步:将数据目录压缩为归档文件(如尚未压缩)** |
|
|
| ```bash |
| cd /workspace/SimToken/data |
| |
| tar -cf image_embed.tar image_embed/ # 不压缩(.pt 已是二进制) |
| tar -czf gt_mask.tar.gz gt_mask/ |
| tar -czf audio_embed.tar.gz audio_embed/ |
| tar -cf media.tar media/ |
| |
| # 确认压缩包存在后删除原始目录 |
| ls -lh *.tar* |
| rm -rf image_embed/ gt_mask/ audio_embed/ media/ |
| ``` |
|
|
| **第二步:清理缓存并上传** |
|
|
| ```bash |
| find /workspace/SimToken -name "__pycache__" -exec rm -rf {} + 2>/dev/null |
| find /workspace/SimToken -name "*.pyc" -delete |
| |
| huggingface-cli login # token 在 https://huggingface.co/settings/tokens 生成(需 Write 权限) |
| |
| cd /workspace/SimToken |
| python upload_hf.py --repo yfan07/SimToken |
| ``` |
|
|
| **注意事项:** |
| - 建议在 `tmux` 里运行,防止 SSH 断开:`tmux new -s upload`,完成后 `Ctrl+B D` detach |
| - 支持断点续传:中断后重新执行同一命令会自动跳过已上传文件 |
| - 遇到 rate limit(HTTP 429)时脚本会自动等待约 1 小时后重试 |
| - 监控进度:`tail -f /workspace/SimToken/upload.log` |
|
|