Jaward Sesay

Jaward

AI & ML interests

Building Lectūra AI | CS Grad Student @BIT | AI/ML Research: Autonomous Agents, LLMs | First Paper (AutoAgents: A Framework for Automatic Agent Generation) Accepted @ IJCAI 2024 | Role Model Karpathy

Recent Activity

upvoted a paper 13 days ago

Group Sequence Policy Optimization

posted an update 20 days ago

Towards batch sizes too small to meter🎉 beautiful work! And my personal favorite so far - I adore peak performance at small/nano scale. Everyone deserves to run/train AGI locally:) our data, our god model! They showed that: - you can train LLMs (upto 1B params) with as low as batch_size=1. This is unconventional given small batch sizes can lead to unstable/spiky training runs. - you can have a stable train run with just vanilla SGD(stochastic gradient descent), no momentum required🤯 - small batch sizes are more robust to hyperparameters (i.e no worries with initialization) - smaller batch sizes outperforms (“better per-Flops performance”) larger batch sizes. “We recommend that practitioners training large models in memory-constrained settings exploit the benefits of small batch sizes rather than trying to emulate the large batch size setting (e.g., through gradient accumulation) typically used in industry.” I’ve been doing this for ages - my mantra: all my experiments must scale on my 8gb ram m2 before moving to gpu. IOW I love being gpu poor. Checkout my nanoAI algo repo: https://github.com/Jaykef/ai-algorithms, all notebooks run on memory as low as 8gb ram

updated a Space 22 days ago

Jaward/Lectura-Demo

View all activity

Organizations

upvoted a paper 13 days ago

Group Sequence Policy Optimization

Paper • 2507.18071 • Published 16 days ago • 274

posted an update 20 days ago

Post

3207

Towards batch sizes too small to meter🎉 beautiful work! And my personal favorite so far - I adore peak performance at small/nano scale. Everyone deserves to run/train AGI locally:) our data, our god model!
They showed that:
- you can train LLMs (upto 1B params) with as low as batch_size=1. This is unconventional given small batch sizes can lead to unstable/spiky training runs.
- you can have a stable train run with just vanilla SGD(stochastic gradient descent), no momentum required🤯
- small batch sizes are more robust to hyperparameters (i.e no worries with initialization)
- smaller batch sizes outperforms (“better per-Flops performance”) larger batch sizes.

“We recommend that practitioners training large models in memory-constrained settings exploit the benefits of small batch sizes rather than trying to emulate the large batch size setting (e.g., through gradient accumulation) typically used in industry.”

I’ve been doing this for ages - my mantra: all my experiments must scale on my 8gb ram m2 before moving to gpu. IOW I love being gpu poor. Checkout my nanoAI algo repo: https://github.com/Jaykef/ai-algorithms, all notebooks run on memory as low as 8gb ram

updated a Space 22 days ago

Lectura Demo

🔥

Lectūra: Your AI Genie for Self-taught Mastery.

liked 2 models 27 days ago

HuggingFaceTB/SmolLM3-3B

Text Generation • 3B • Updated 12 days ago • 839k • • 644

moonshotai/Kimi-K2-Instruct

Text Generation • Updated 12 days ago • 445k • • 2.05k

upvoted a paper about 1 month ago

WebSailor: Navigating Super-human Reasoning for Web Agent

Paper • 2507.02592 • Published Jul 3 • 107

posted an update about 1 month ago

Post

2039

I played around with the new RXTX paper (XX^T) and was able to train nanogpt with 4x4 RXTX matmuls in both attention layer and optimizer🤕
It just works (well I had to add some guardrails) but still saves 5% of memory usage:
The Patch:
- Computes attention scores with a 4x4 blockwise RXTX matmuls (no pytorch dot prod)
- Handles arbitrary sequence lengths by padding to the nearest multiple of 4.
- An RXTX variant of shampoo with params reshaped into 4x4 blocks during each optimizer step.
- Uses 5% less ops
Code: https://github.com/Jaykef/ai-algorithms/blob/main/nanogpt-rxtx.ipynb
Paper: https://arxiv.org/pdf/2505.09814

liked a model about 1 month ago

google/gemma-3n-E4B-it

Image-Text-to-Text • 8B • Updated 26 days ago • 134k • 703

posted an update about 1 month ago

Post

2325

Mind2Web 2 is out - this time featuring eval and benchmark for deep research🔥
Paper: Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge (2506.21506)
Project: https://osu-nlp-group.github.io/Mind2Web-2/

upvoted 2 papers about 1 month ago

Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

Paper • 2506.21506 • Published Jun 26 • 49

DualTHOR: A Dual-Arm Humanoid Simulation Platform for Contingency-Aware Planning

Paper • 2506.16012 • Published Jun 19 • 22

posted an update about 2 months ago

Post

3443

Awesome intro to LLM course "Language Modeling from Scratch" by stanford. love the aesthetics behind the lecture notes, notes-in-code genius idea👍
Course site: https://stanford-cs336.github.io/spring2025/
Repo: https://github.com/stanford-cs336/spring2025-lectures
Videos: https://www.youtube.com/playlist?list=PLoROMvodv4rOY23Y0BoGoBGgQ1zmU_MT_

2 replies

posted an update about 2 months ago

Post

1443

not sure of what to make of this but solving autonomous/selective reflection seems like a big deal in current agent frameworks. We did hit on this with iterative self-refinement in our AutoAgents framework (https://ijcai.org/proceedings/2024/0003.pdf). Nice read, looking forward to the code.
Paper: Scaling Test-time Compute for LLM Agents (2506.12928)

upvoted 2 papers about 2 months ago

Scaling Test-time Compute for LLM Agents

Paper • 2506.12928 • Published Jun 15 • 61

Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model

Paper • 2506.13642 • Published Jun 16 • 27

replied to their post about 2 months ago

will cook a deep dive tutorial on dfms sometime next week, the math is nolonger scary after taking this course:)
https://diffusion.csail.mit.edu/

posted an update about 2 months ago

Post

1404

You can now edit operations with a discrete flow model, supercool👍! It's amazing to see the progress on DFM within one year since its introduction - literally my litmus test for how fast the field is progressing:
1st Introduced (2024): https://arxiv.org/abs/2402.04997
Discrete Flow Matching (2024): https://arxiv.org/abs/2407.15595
Edit Discrete Flow (2025): https://arxiv.org/pdf/2506.09018
Looking forward to a SaaS level reach like that of dLLMs e.g Mercury by inception labs 🚀

1 reply

upvoted a paper about 2 months ago

SpatialLM: Training Large Language Models for Structured Indoor Modeling

Paper • 2506.07491 • Published Jun 9 • 42

upvoted an article 2 months ago

Article

KV Cache from scratch in nanoVLM

and 4 others •

Jun 4

• 89

posted an update 2 months ago

Post

1182

bumped into one of the OG reads today!! handwriting generation & synthesis is still my favorite application of RNNs - supper amazed at how such a small model (3.6M params), trained overnight on cpu could reach such peak performance. Huge credit to the data (IAM-OnDB🔥) which was meticulously curated using an infra-red device to track pen position.
Try demo here: https://www.calligrapher.ai/
Code: https://github.com/sjvasquez/handwriting-synthesis

Jaward Sesay

AI & ML interests

Recent Activity

Organizations

Jaward's activity

Lectura Demo

KV Cache from scratch in nanoVLM