AI & ML interests

Computer Vision Technology and Data Collection for Anime Waifu

Recent Activity

prithivMLmods 
posted an update about 18 hours ago
view post
Post
582
I've added the demo of the openbmb/MiniCPM-V-4 model to the Hugging Face Space:
prithivMLmods/Multimodal-VLM-Thinking

✨ MiniCPM-V 4.0 is the latest efficient model in the MiniCPM-V series. The model is built based on SigLIP2-400M and MiniCPM4-3B, with a total of 4.1B parameters. It inherits the strong single-image, multi-image, and video understanding performance of MiniCPM-V 2.6 with largely improved efficiency.

✨ With only 4.1B parameters, MiniCPM-V 4.0 achieves an average score of 69.0 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. This performance surpasses GPT-4.1-mini-20250414, MiniCPM-V 2.6 (8.1B parameters, OpenCompass 65.2), and Qwen2.5-VL-3B-Instruct (3.8B parameters, OpenCompass 64.5). It also shows good performance in multi-image and video understanding.

The community GPU grant was given by Hugging Face — special thanks to them. 🤗🚀

To know more about it, visit the model card of the respective model. !!
ImranzamanML 
posted an update 1 day ago
view post
Post
173
OpenAI has launched GPT-5, a significant leap forward in AI technology that is now available to all users. The new model unifies all of OpenAI's previous developments into a single, cohesive system that automatically adapts its approach based on the complexity of the user's request. This means it can prioritize speed for simple queries or engage a deeper reasoning model for more complex problems, all without the user having to manually switch settings.

Key Features and Improvements
Unified System: GPT-5 combines various models into one interface, intelligently selecting the best approach for each query.

Enhanced Coding: It's being hailed as the "strongest coding model to date," with the ability to create complex, responsive websites and applications from a single prompt.

PhD-level Reasoning: According to CEO Sam Altman, GPT-5 offers a significant jump in reasoning ability, with a much lower hallucination rate. It also performs better on academic and human-evaluated benchmarks.

New Personalities: Users can now select from four preset personalities—Cynic, Robot, Listener and Nerd to customize their chat experience.

Advanced Voice Mode: The voice mode has been improved to sound more natural and adapt its speech based on the context of the conversation.


https://openai.com/index/introducing-gpt-5/
https://openai.com/gpt-5/
ImranzamanML 
posted an update 3 days ago
view post
Post
209
All key links to OpenAI open sourced GPT OSS models (117B and 21B) which are released under apache 2.0. Here is a quick guide to explore and build with them:

Intro & vision: https://openai.com/index/introducing-gpt-oss

Model specs & license: https://openai.com/index/gpt-oss-model-card/

Dev overview: https://cookbook.openai.com/topic/gpt-oss

How to run via vLLM: https://cookbook.openai.com/articles/gpt-oss/run-vllm

Harmony I/O format: https://github.com/openai/harmony

Reference PyTorch code: https://github.com/openai/gpt-oss?tab=readme-ov-file#reference-pytorch-implementation

Community site: https://gpt-oss.com/

Lets deep dive with OpenAI models now 😊

#OpenSource #AI #GPTOSS #OpenAI #LLM #Python #GenAI
ImranzamanML 
posted an update 4 days ago
view post
Post
3459
Finaly OpenAI is open to share open-source models after GPT2-2019.
gpt-oss-120b
gpt-oss-20b

openai/gpt-oss-120b

#AI #GPT #LLM #Openai
  • 1 reply
·
prithivMLmods 
posted an update 4 days ago
view post
Post
4121
Qwen Image – The Latest Image Generation Model🔥

Below are some samples generated using the Qwen Image Diffusion Model. Qwen-Image, a 20B MMDiT model for next-generation text-to-image generation, preserves typographic details, layout coherence, and contextual harmony with stunning accuracy. It is especially strong at creating stunning graphic posters with native text. The model is now open-source. [ 𝚀𝚠𝚎𝚗-𝙸𝚖𝚊𝚐𝚎 : Qwen/Qwen-Image ]

⤷ Try the Qwen Image demo here: prithivMLmods/Qwen-Image-Diffusion

⤷ Qwen-Image Technical Report : Qwen-Image Technical Report (2508.02324)
⤷ Qwen Image [GitHub] : https://github.com/QwenLM/Qwen-Image

Even more impressively, it demonstrates a strong ability to understand images. The model supports a wide range of vision-related tasks such as object detection, semantic segmentation, depth and edge (Canny) estimation, novel view synthesis, and image super-resolution. While each task is technically distinct, they can all be viewed as advanced forms of intelligent image editing driven by deep visual understanding. Collectively, these capabilities position Qwen-Image as more than just a tool for generating appealing visuals, it serves as a versatile foundation model for intelligent visual creation and transformation, seamlessly blending language, layout, and imagery.

Qwen-Image uses a dual-stream MMDiT architecture with a frozen Qwen2.5-VL, VAE encoder, RMSNorm for QK-Norm, LayerNorm elsewhere, and a custom MSRoPE scheme for joint image-text positional encoding.

.
.
.
To know more about it, visit the model card of the respective model. !!
Tonic 
posted an update 7 days ago
prithivMLmods 
posted an update 7 days ago
view post
Post
3126
Introducing Camel-Doc-OCR-080125(v2), a document content-structure retrieval VLM designed for content extraction and summarization. This is the second model in the Camel Doc OCR VLM series, following Camel-Doc-OCR-062825(v1). The new version fixes formal table reconstruction issues in both en and zh language, achieving optimal performance for long-context inferences.🤗🐪

⤷ Camel-Doc-OCR(v2) : prithivMLmods/Camel-Doc-OCR-080125
⤷ Camel-Doc-OCR(v1) : prithivMLmods/Camel-Doc-OCR-062825
⤷ Demo : prithivMLmods/core-OCR

Multimodal Model Collections and Spaces:

➝ Camel-Doc-OCR : prithivMLmods/camel-doc-ocr-080125-688c0c61c5dba648756f31f8
➝ Vision-Language (VLr) : prithivMLmods/vision-language-for-reasoning-vlr-6889b3f45917352b5e3a6f7a
➝ Multimodal Spaces : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0
➝ Multimodal VLMs : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027

.
.
.
To know more about it, visit the model card of the respective model. !!
  • 2 replies
·
ImranzamanML 
posted an update 7 days ago
view post
Post
290
Working of Transformer model layers!

I focused on showing the core steps side by side with tokenization, embedding and the transformer model layers, each highlighting the self attention and feedforward parts without getting lost in too much technical depth.

Its showing how these layers work together to understand context and generate meaningful output!

If you are curious about the architecture behind AI language models or want a clean way to explain it, hit me up, I’d love to share!



#AI #MachineLearning #NLP #Transformers #DeepLearning #DataScience #LLM #AIAgents
prithivMLmods 
posted an update 9 days ago
view post
Post
1060
Exciting to bring the explicitly grounded experimental reasoning model, Lumian-VLR-7B-Thinking, built on top of Qwen2.5-VL, featuring reasoning-aware trajectories with enhanced spatial perception. Along with this, we’ve also added a demo for the model while bringing some of the latest and most interesting models available on the hub to make full use of the remaining resources.

✨ Multimodal-VLM-Thinking : prithivMLmods/Multimodal-VLM-Thinking
✨ Multimodal-VLM-OCR : prithivMLmods/Multimodal-VLM-OCR

✦ Models used in these spaces:

✨ Lumian-VLR-7B-Thinking : prithivMLmods/Lumian-VLR-7B-Thinking
✨ Enesidaon-VLR-7B-no-Thinking : prithivMLmods/Enesidaon-VLR-7B-no-Thinking
✨ GLM-4.1V-9B-Thinking : zai-org/GLM-4.1V-9B-Thinking
✨ DREX-062225-exp : prithivMLmods/DREX-062225-exp & more ...

✦ Multimodal Model Collections and Spaces:

✨ Vision-Language (VLr) : prithivMLmods/vision-language-for-reasoning-vlr-6889b3f45917352b5e3a6f7a
✨ Multimodal Spaces : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0
✨ Multimodal VLMs : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027

.
.
.
To know more about it, visit the model card of the respective model. !!