Abstract
Perception-centric systems are typically implemented with a modular encoder-decoder pipeline: a vision backbone for feature extraction and a separate decoder (or late-fusion module) for task prediction. This raises a central question: is this architectural separation essential or can a single early-fusion stack do both perception and task modeling at scale? We introduce Falcon Perception, a unified dense Transformer that processes image patches and text tokens in a shared parameter space from the first layer, using a hybrid attention pattern (bidirectional among image tokens, causal for prediction tokens) to combine global visual context with autoregressive, variable-length instance generation. To keep dense outputs practical, Falcon Perception retains a lightweight token interface and decodes continuous spatial outputs with specialized heads, enabling parallel high-resolution mask prediction. Our design promotes simplicity: we keep a single scalable backbone and shift complexity toward data and training signals, adding only small heads where outputs are continuous and dense. On SA-Co, Falcon Perception improves mask quality to 68.0 Macro-F_1 compared to 62.3 of SAM3. We also introduce PBench, a benchmark targeting compositional prompts (OCR, spatial constraints, relations) and dense long-context regimes, where the model shows better gains. Finally, we extend the same early-fusion recipe to Falcon OCR: a compact 300M-parameter model which attains 80.3% on olmOCR and 88.64 on OmniDocBench.
Community
Open vocab segmentation is here with Falcon Perception
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams (2026)
- UniCompress: Token Compression for Unified Vision-Language Understanding and Generation (2026)
- Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders (2026)
- DriveMamba: Task-Centric Scalable State Space Model for Efficient End-to-End Autonomous Driving (2026)
- Kelix Technical Report (2026)
- EdgeCrafter: Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation (2026)
- Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2603.27365 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 2
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper