PZ's picture

Building on HF

PZ

philipp-zettl

·

https://blog.godesteem.de

philsupertramp

AI & ML interests

NLP/CV/Multimodal learning

Recent Activity

repliedto their post about 22 hours ago

I've been cooking something neat over the past weeks 👨‍🍳 We all know that training LLMs requires a lot of resources and especially a lot of compute in form of GPUs, or is super slow and inefficient when done on CPUs. The big players use giant clusters of Nvidia H100s. But if I look at the profiles of my fellow home brewers, all we can get our hands on are those pesky consumer RTX's. If you're lucky you got yourself a 5080 with 16GB VRAM or something. To be frank, I don't have that 1.3k disposable cash laying around ¯\_(ツ)_/¯ But I can write rust and like building ML libraries. So I asked myself the question(s): - can I train SMLs at home on my hardware? - How hard can it be to build a ML library that can stream data between RAM and VRAM on demand, like llama.cpp's unified memory feature [^1]? - how hard can it be to implement bf16 support? The answers are wild, trust me! Image 1: Metrics form last nights build on my "tiny" RTX 2060 (6 GB VRAM) Image 2: Metrics from my most recent build on my RTX 4070 Laptop (8GB VRAM) The majority of my time went into the shared memory, but it's stable and I'm very excited! Here some debug logs, a la "trust me bro" ``` ---- Currently available: 1112735744, attempting to reclaim: 1073741824 --- VRAM STATE [backward pass] --- Driver Used: 6744 MB / 7805 MB Data on GPU: 1641 MB Grads on GPU: 3459 MB CPU Offloaded: 18230 MB --------------------------------- Currently available: 1079181312, attempting to reclaim: 1073741824 --- VRAM STATE [backward pass] --- Driver Used: 6776 MB / 7805 MB Data on GPU: 1561 MB Grads on GPU: 3279 MB CPU Offloaded: 18590 MB ----------------------------- ``` Final models get exported in `safetensors` format and are compatible with PyTorch and `transformers`, for accessibility. - [^1]: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#unified-memory

posted an update about 22 hours ago

I've been cooking something neat over the past weeks 👨‍🍳 We all know that training LLMs requires a lot of resources and especially a lot of compute in form of GPUs, or is super slow and inefficient when done on CPUs. The big players use giant clusters of Nvidia H100s. But if I look at the profiles of my fellow home brewers, all we can get our hands on are those pesky consumer RTX's. If you're lucky you got yourself a 5080 with 16GB VRAM or something. To be frank, I don't have that 1.3k disposable cash laying around ¯\_(ツ)_/¯ But I can write rust and like building ML libraries. So I asked myself the question(s): - can I train SMLs at home on my hardware? - How hard can it be to build a ML library that can stream data between RAM and VRAM on demand, like llama.cpp's unified memory feature [^1]? - how hard can it be to implement bf16 support? The answers are wild, trust me! Image 1: Metrics form last nights build on my "tiny" RTX 2060 (6 GB VRAM) Image 2: Metrics from my most recent build on my RTX 4070 Laptop (8GB VRAM) The majority of my time went into the shared memory, but it's stable and I'm very excited! Here some debug logs, a la "trust me bro" ``` ---- Currently available: 1112735744, attempting to reclaim: 1073741824 --- VRAM STATE [backward pass] --- Driver Used: 6744 MB / 7805 MB Data on GPU: 1641 MB Grads on GPU: 3459 MB CPU Offloaded: 18230 MB --------------------------------- Currently available: 1079181312, attempting to reclaim: 1073741824 --- VRAM STATE [backward pass] --- Driver Used: 6776 MB / 7805 MB Data on GPU: 1561 MB Grads on GPU: 3279 MB CPU Offloaded: 18590 MB ----------------------------- ``` Final models get exported in `safetensors` format and are compatible with PyTorch and `transformers`, for accessibility. - [^1]: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#unified-memory

upvoted an article 3 days ago

Safetensors is Joining the PyTorch Foundation

View all activity

Organizations

philipp-zettl 's datasets 16

philipp-zettl/qwen3-0.6b-german-training

Viewer • Updated 10 days ago • 1.6k • 26

philipp-zettl/inaturalist-bbs

Updated 11 days ago • 59

philipp-zettl/inaturalist-enriched

Viewer • Updated 12 days ago • 1.49M • 147

philipp-zettl/inaturalist-s3-massive

Viewer • Updated 12 days ago • 1.49M • 1.2k

philipp-zettl/pile-of-law_atticus_contracts

Viewer • Updated Dec 29, 2025 • 488k • 73

philipp-zettl/pile-of-law

Updated Dec 29, 2025 • 89

philipp-zettl/DeepJSONEval

Viewer • Updated Dec 13, 2025 • 2.1k • 23

philipp-zettl/MTGEmb-small-embs

Viewer • Updated Nov 7, 2025 • 59.8k • 7

philipp-zettl/my_first_lora_v1-dataset

Viewer • Updated Oct 1, 2025 • 7 • 17

philipp-zettl/NibbleNix-DE

Viewer • Updated Sep 26, 2025 • 4M • 21

philipp-zettl/mtg_cards-2025-04-04

Viewer • Updated Jun 27, 2025 • 29.9k • 188

philipp-zettl/chessPT-data

Viewer • Updated Oct 8, 2024 • 6.88M • 28 • 1

philipp-zettl/amazon_massive_intent-similarity

Viewer • Updated Aug 29, 2024 • 2.57M • 7

philipp-zettl/long-qa

Viewer • Updated Jun 9, 2024 • 1.25k • 9

philipp-zettl/qg-tydiqa_squad2

Viewer • Updated Jun 4, 2024 • 90.2k • 14

philipp-zettl/tydiqa-task_2-english

Viewer • Updated Jun 2, 2024 • 4.14k • 14