Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up

All HF Hub posts

hesamation 
posted an update about 18 hours ago
view post
Post
984
longer context doesn't generate better responses. it can even hurt your llm/agent. 1M context window doesn't automatically make models smarter as it's not about the size; it's how you use it.

here are 4 types of context failure and why each one happens:

1. context poisoning: if hallucination finds its way into your context, the agent will rely on that false information to make its future moves. for example if the agent hallucinates about the "task description", all of its planning to solve the task would also be corrupt.

2. context distraction: when the context becomes too bloated, the model focuses too much on it rather than come up with novel ideas or to follow what it has learned during training. as Gemini 2.5 Pro technical report points out, as context grows significantly from 100K tokens, "the agent showed a tendency toward favoring repeating actions from its vast history rather than synthesizing novel plans".

3. context confusion: everyone lost it when MCPs became popular, it seemed like AGI was achieved. I suspected there is something wrong and there was: it's not just about providing tools, bloating the context with tool use derails the model from selecting the right one! even if you can fit all your tool metadata in the context, as their number grows, the model gets confused over which one to pick.

4. Context Clash: if you exchange conversation with a model step by step and provide information as you go along, chances are you get worse performance rather than providing all the useful information at once. one the model's context fills with wrong information, it's more difficult to guide it to embrace the right info. agents pull information from tools, documents, user queries, etc. and there is a chance that some of these information contradict each other, and it's not good new for agentic applications.

check this article by Drew Breunig for deeper read: https://www.dbreunig.com/2025/06/26/how-to-fix-your-context.html?ref=blog.langchain.com
  • 2 replies
·
AdinaY 
posted an update 2 days ago
view post
Post
3181
Qwen3-Coder 💻 agentic code model by Alibaba Qwen team🚀

Qwen/Qwen3-Coder-480B-A35B-Instruct

✨ 480B total, 35B activated MoE
✨ Agentic Coding + Browser Use → Top code model performance
✨ 256K context (up to 1M via Yarn) for repo-scale understanding
  • 2 replies
·
andito 
posted an update 1 day ago
view post
Post
1787
Many VLMs claim to process hours of video. But can they follow the story?🤔
Today, we introduce TimeScope: The benchmark that separates true temporal understanding from marketing hype. Let's see how much VLMs really understand!⏳

We test three skills that matter for real-world use:
🔎 Localized Retrieval: Find a specific action.
🧩 Information Synthesis: Piece together scattered clues.
🏃 Fine-Grained Perception: Analyze detailed motion (e.g., count how many times a person swings an axe).

The results are in, and they're revealing. Only Gemini 2.5 pro handles 1-hour-long videos.
Performance drops sharply with duration, proving that long video understanding is still challenging. We've found the breaking points—now the community can start fixing them.📈

Want to learn more? TimeScope is 100% open-source. Benchmark your model and help us build the next generation of video AI.

📖 Blog:
https://huggingface.co/blog/timescope-video-lmm-benchmark
👩‍💻 Leaderboard & Demo: Apollo-LMMs/TimeScope
📊 Dataset: Apollo-LMMs/TimeScope
⚙️ Eval Code: https://github.com/EvolvingLMMs-Lab/lmms-eval
AdinaY 
posted an update 3 days ago
view post
Post
2521
KAT-V1 🔥 a LLM that tackles overthinking by switching between reasoning and direct answers, by Kuaishou.

Kwaipilot/KAT-V1-40B

✨ 40B
✨ Step-SRPO: smarter reasoning control via RL
✨ MTP + Distillation: efficient training, lower cost
mitkox 
posted an update about 19 hours ago
view post
Post
534
I run Qwen3-Coder 480B locally on my Z8, with a 1-million token context window. It’s the equivalent of parallel-parking a Nimitz-class carrier in a kiddie pool. Thanks to whatever dark pact the llama.cpp, CUDA, and kernel folks signed, hybrid inferencing + VRAM↔RAM offload let me stream the model’s synapses across Xeon, RAM, and four lonely A6000s without summoning either the OOM killer or a small house fire.
prithivMLmods 
posted an update about 23 hours ago
view post
Post
1196
olmOCR [Allen AI] just got an upgrade! 📈🧑‍🍳

The allenai/olmOCR-7B-0725 — fine-tuned with allenai/olmOCR-mix-0225 on top of Qwen/Qwen2.5-VL-7B-Instruct, pushing the boundaries of OCR technology. It takes a single document image as input, with the longest side resized to 1288 pixels. High-quality, openly available approach to parsing pdfs and other complex documents optical character recognition.

Try the demo here: prithivMLmods/Multimodal-OCR

✨ Model: allenai/olmOCR-7B-0725
✨ Model [fp8]: allenai/olmOCR-7B-0725-FP8
✨ Multimodal Implementations Space Collection: prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

.
.
.
To know more about it, visit the model card of the respective model. !!
danielhanchen 
posted an update 1 day ago
wenhuach 
posted an update about 24 hours ago
view post
Post
904
🚀 AutoRound(https://github.com/intel/auto-round) Now Supports GGUF Export & Custom Bit Settings!

We're excited to announce that AutoRound now supports:
✅ GGUF format export – for seamless compatibility with popular inference engines.
✅ Custom bit settings – tailor quantization to your needs for optimal performance.

Check out these newly released models:
🔹Intel/Qwen3-235B-A22B-Instruct-2507-gguf-q4km-AutoRound
🔹Intel/Qwen3-235B-A22B-Instruct-2507-gguf-q2ks-mixed-AutoRound
🔹Intel/Kimi-K2-Instruct-gguf-q2ks-mixed-AutoRound

Stay tuned! An even more advanced algorithm for some configurations is coming soon.
YerbaPage 
posted an update 2 days ago
view post
Post
1362
How to achieve 100% Pass Rate on HumanEval ? 🔥

Meet MGDebugger if you are tired of LLMs failing on complex bugs 🤔 Our MGDebugger, just hit 100% accuracy on HumanEval using the DeepSeek-R1 model. 🚀

✨ Demo: learnmlf/MGDebugger
📝 Paper: From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging (2410.01215)
💻 Code: https://github.com/YerbaPage/MGDebugger

HumanEval may be retired, we're ready for the next challenge In more complex scenarios! You may also take look at this repo for a collection of awesome repo-level coding tasks!

🖥️ https://github.com/YerbaPage/Awesome-Repo-Level-Code-Generation
dhruv3006 
posted an update 2 days ago