This is a good idea
Alan Dao
alandao
AI & ML interests
None yet
Recent Activity
new activity
21 days ago
janhq/Jan-v3-4B-base-instruct:Update README.md new activity
26 days ago
janhq/Jan-v3-4B-base-instruct-gguf:Update README.md new activity
26 days ago
janhq/Jan-v3-4B-base-instruct:Update README.md Organizations
reacted to andito's post with 🔥 8 months ago
Post
4083
🧠👁️ Can AI visualize solutions?
Humans often solve visual problems by sketching ideas in our minds. What if Vision-Language Models (VLMs) could do something similar, not by generating full images, but by using internal “mental sketches”?
That’s the idea behind Mirage, a new framework that empowers VLMs to reason using latent visual tokens. Instead of just thinking in words, Mirage mixes in abstract visual representations that help the model solve complex tasks.
These aren't photorealistic images. They're compact, internal representations optimized purely to support reasoning.
🔧 Mirage is trained in two phases:
1) Grounding: It learns to produce latent tokens anchored in real images.
2) Refinement: The model drops the images and learns to generate visual tokens on its own.
📈 And yes, it works!
On challenging benchmarks like Visual Spatial Planning, Jigsaw puzzles, and Spatial Attention Tasks, Mirage clearly outperforms GPT-4o and other strong baselines.
Smart sketches > empty words.
By mimicking the way humans visualize solutions, Mirage gives AI a new kind of imagination, one that’s faster, more efficient, and more human-like.
Kudos to the teams at UMass Amherst and MIT behind this exciting work.
Check the paper: Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (2506.17218)
Humans often solve visual problems by sketching ideas in our minds. What if Vision-Language Models (VLMs) could do something similar, not by generating full images, but by using internal “mental sketches”?
That’s the idea behind Mirage, a new framework that empowers VLMs to reason using latent visual tokens. Instead of just thinking in words, Mirage mixes in abstract visual representations that help the model solve complex tasks.
These aren't photorealistic images. They're compact, internal representations optimized purely to support reasoning.
🔧 Mirage is trained in two phases:
1) Grounding: It learns to produce latent tokens anchored in real images.
2) Refinement: The model drops the images and learns to generate visual tokens on its own.
📈 And yes, it works!
On challenging benchmarks like Visual Spatial Planning, Jigsaw puzzles, and Spatial Attention Tasks, Mirage clearly outperforms GPT-4o and other strong baselines.
Smart sketches > empty words.
By mimicking the way humans visualize solutions, Mirage gives AI a new kind of imagination, one that’s faster, more efficient, and more human-like.
Kudos to the teams at UMass Amherst and MIT behind this exciting work.
Check the paper: Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (2506.17218)
Post
1399
Don’t give up 🔥
Do you know what I was planning to do this time last week?
I was preparing to write a report declaring that Jan Nano was a failed project because the benchmark results didn’t meet expectations.
But I thought — it can’t be. When loading the model into the app, the performance clearly felt better. So why were the benchmark results worse?
That’s when I reviewed the entire benchmark codebase and realized something fundamental: agentic or workflow-based approaches introduce a huge gap and variation when benchmarking. Jan-nano was trained with an agentic setup — it simply can’t be benchmarked using a rigid workflow-based method.
I made the necessary changes, and the model ended up performing even better than before the issues arose. Turns out the previous benchmarking method conflicted with the way the model was trained.
What if I had given up? That would’ve meant 1.5 months of training and a huge amount of company resources wasted.
But now, this is officially the most successful and biggest release for the whole team — all thanks to Jan-nano.
Menlo/Jan-nano
Do you know what I was planning to do this time last week?
I was preparing to write a report declaring that Jan Nano was a failed project because the benchmark results didn’t meet expectations.
But I thought — it can’t be. When loading the model into the app, the performance clearly felt better. So why were the benchmark results worse?
That’s when I reviewed the entire benchmark codebase and realized something fundamental: agentic or workflow-based approaches introduce a huge gap and variation when benchmarking. Jan-nano was trained with an agentic setup — it simply can’t be benchmarked using a rigid workflow-based method.
I made the necessary changes, and the model ended up performing even better than before the issues arose. Turns out the previous benchmarking method conflicted with the way the model was trained.
What if I had given up? That would’ve meant 1.5 months of training and a huge amount of company resources wasted.
But now, this is officially the most successful and biggest release for the whole team — all thanks to Jan-nano.
Menlo/Jan-nano
posted an
update 8 months ago
Post
1399
Don’t give up 🔥
Do you know what I was planning to do this time last week?
I was preparing to write a report declaring that Jan Nano was a failed project because the benchmark results didn’t meet expectations.
But I thought — it can’t be. When loading the model into the app, the performance clearly felt better. So why were the benchmark results worse?
That’s when I reviewed the entire benchmark codebase and realized something fundamental: agentic or workflow-based approaches introduce a huge gap and variation when benchmarking. Jan-nano was trained with an agentic setup — it simply can’t be benchmarked using a rigid workflow-based method.
I made the necessary changes, and the model ended up performing even better than before the issues arose. Turns out the previous benchmarking method conflicted with the way the model was trained.
What if I had given up? That would’ve meant 1.5 months of training and a huge amount of company resources wasted.
But now, this is officially the most successful and biggest release for the whole team — all thanks to Jan-nano.
Menlo/Jan-nano
Do you know what I was planning to do this time last week?
I was preparing to write a report declaring that Jan Nano was a failed project because the benchmark results didn’t meet expectations.
But I thought — it can’t be. When loading the model into the app, the performance clearly felt better. So why were the benchmark results worse?
That’s when I reviewed the entire benchmark codebase and realized something fundamental: agentic or workflow-based approaches introduce a huge gap and variation when benchmarking. Jan-nano was trained with an agentic setup — it simply can’t be benchmarked using a rigid workflow-based method.
I made the necessary changes, and the model ended up performing even better than before the issues arose. Turns out the previous benchmarking method conflicted with the way the model was trained.
What if I had given up? That would’ve meant 1.5 months of training and a huge amount of company resources wasted.
But now, this is officially the most successful and biggest release for the whole team — all thanks to Jan-nano.
Menlo/Jan-nano
Great job guys, reasoning bringing so many potential!
we also have similiar idea! but only applied for maze
reacted to reach-vb's post with 😎 over 1 year ago
Post
5710
Multimodal Ichigo Llama 3.1 - Real Time Voice AI 🔥
> WhisperSpeech X Llama 3.1 8B
> Trained on 50K hours of speech (7 languages)
> Continually trained on 45hrs 10x A1000s
> MLS -> WhisperVQ tokens -> Llama 3.1
> Instruction tuned on 1.89M samples
> 70% speech, 20% transcription, 10% text
> Apache 2.0 licensed ⚡
Architecture:
> WhisperSpeech/ VQ for Semantic Tokens
> Llama 3.1 8B Instruct for Text backbone
> Early fusion (Chameleon)
I'm super bullish on HomeBrew/ Jan and early fusion, audio and text, multimodal models!
(P.S. Play with the demo on Hugging Face: jan-hq/Ichigo-llama3.1-s-instruct)
> WhisperSpeech X Llama 3.1 8B
> Trained on 50K hours of speech (7 languages)
> Continually trained on 45hrs 10x A1000s
> MLS -> WhisperVQ tokens -> Llama 3.1
> Instruction tuned on 1.89M samples
> 70% speech, 20% transcription, 10% text
> Apache 2.0 licensed ⚡
Architecture:
> WhisperSpeech/ VQ for Semantic Tokens
> Llama 3.1 8B Instruct for Text backbone
> Early fusion (Chameleon)
I'm super bullish on HomeBrew/ Jan and early fusion, audio and text, multimodal models!
(P.S. Play with the demo on Hugging Face: jan-hq/Ichigo-llama3.1-s-instruct)