|
--- |
|
license: apache-2.0 |
|
tags: |
|
- cvpr25 |
|
- JarvisIR |
|
- weights |
|
description: | |
|
This repository contains the official weights for the CVPR 2025 paper "JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration". |
|
--- |
|
|
|
# JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration |
|
|
|
## Model Description |
|
|
|
JarvisIR is a novel system that leverages a Vision-Language Model (VLM) to intelligently restore images for autonomous driving perception in adverse weather. It acts as a central controller, dynamically coordinating multiple expert restoration models to tackle complex degradations such as rain, fog, low-light, and snow. |
|
|
|
## Key Features |
|
|
|
- **VLM Controller**: The first framework to employ a Vision-Language Model for orchestrating image restoration workflows. |
|
- **Multi-Expert Coordination**: Dynamically schedules specialized restoration models for tasks like denoising, super-resolution, and deraining. |
|
- **Adaptive Restoration**: Effectively handles a wide range of adverse weather conditions, including night/low-light, rain, fog, and snow. |
|
- **Advanced Training Strategy**: Utilizes a two-stage process of Supervised Fine-Tuning (SFT) followed by alignment with Mixed-Rank Reward-based Human Feedback (MRRHF). |
|
|
|
## Model Architecture |
|
|
|
The system comprises three core components: |
|
|
|
1. **VLM Controller**: A LLaVA-v1.5-7B model serves as the core for task planning and expert model selection. |
|
2. **Expert Models**: A suite of specialized networks, each tailored for a specific restoration task (e.g., deraining, defogging). |
|
3. **Reward Models**: A set of Image Quality Assessment (IQA) models that provide feedback for quality assessment and alignment during training. |
|
|
|
## Training Data |
|
|
|
JarvisIR was trained on a large-scale, comprehensive dataset: |
|
|
|
- **CleanBench-Synthetic**: A dataset of 150,000 synthetically degraded images with corresponding annotations. |
|
- **CleanBench-Real**: A collection of 80,000 real-world images captured in adverse weather, used for alignment training. |
|
- **Comprehensive Coverage**: The data covers four primary weather scenarios (night, rain, fog, snow) with various combinations of degradations. |
|
|
|
## Performance |
|
|
|
- Achieves a **50% average improvement** in perception metrics on the CleanBench-Real dataset compared to state-of-the-art all-in-one methods. |
|
- Demonstrates superior performance across all tested weather conditions. |
|
- Exhibits enhanced robustness and generalization capabilities in real-world driving scenarios. |
|
|
|
## Intended Use |
|
|
|
**Primary Use Cases:** |
|
- Enhancing perception systems in autonomous vehicles. |
|
- Building robust, multi-weather image restoration pipelines. |
|
- Advancing research into the applications of Vision-Language Models in image processing. |
|
|
|
## Model Checkpoints |
|
|
|
This repository provides the following model weights: |
|
- `pertained`: The complete model after both Supervised Fine-Tuning and MRRHF alignment stages. |
|
- `agent-tools/`: The weights for each individual expert restoration model. |
|
|
|
## Citation |
|
|
|
If you find JarvisIR useful in your research, please cite our paper: |
|
|
|
```bibtex |
|
@inproceedings{lin2025jarvisir, |
|
title={Jarvisir: Elevating autonomous driving perception with intelligent image restoration}, |
|
author={Lin, Yunlong and Lin, Zixu and Chen, Haoyu and Pan, Panwang and Li, Chenxin and Chen, Sixiang and Wen, Kairun and Jin, Yeying and Li, Wenbo and Ding, Xinghao}, |
|
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference}, |
|
pages={22369--22380}, |
|
year={2025} |
|
} |
|
``` |
|
|
|
## Related Resources |
|
|
|
- **Project Page**: https://cvpr2025-jarvisir.github.io/ |
|
- **Code Repository**: https://github.com/LYL1015/JarvisIR |
|
- **Paper**: https://arxiv.org/pdf/2504.04158 |
|
|
|
## Acknowledgments |
|
|
|
This work contributes to the advancement of intelligent image restoration by integrating Vision-Language Models with expert system coordination. |