File size: 5,461 Bytes
726aaad 24ac8c3 e8627e3 24ac8c3 e8627e3 ac71fc0 e8627e3 ac71fc0 e8627e3 ac71fc0 e8627e3 ac71fc0 e8627e3 ac71fc0 e8627e3 ac71fc0 e8627e3 ac71fc0 e8627e3 ac71fc0 e8627e3 ac71fc0 e8627e3 ac71fc0 e8627e3 ac71fc0 e8627e3 ac71fc0 e8627e3 ac71fc0 e8627e3 ac71fc0 e8627e3 ac71fc0 e8627e3 ac71fc0 e8627e3 ac71fc0 e8627e3 ac71fc0 e8627e3 ac71fc0 e8627e3 ac71fc0 e8627e3 ac71fc0 e8627e3 ac71fc0 e8627e3 ac71fc0 726aaad ac71fc0 726aaad ac71fc0 726aaad 3678822 e8627e3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 |
---
license: mit
tags:
- low-light
- low-light-image-enhancement
- image-enhancement
- image-restoration
- computer-vision
- low-light-enhance
- multimodal
- multimodal-learning
- transformer
- transformers
- vision-transformer
- vision-transformers
model-index:
- name: ModalFormer
results:
- task:
type: low-light-image-enhancement
dataset:
name: LOL-v1
type: LOL-v1
metrics:
- type: PSNR
value: 27.97
name: PSNR
- type: SSIM
value: 0.897
name: SSIM
- task:
type: low-light-image-enhancement
dataset:
name: LOL-v2-Real
type: LOL-v2-Real
metrics:
- type: PSNR
value: 29.33
name: PSNR
- type: SSIM
value: 0.915
name: SSIM
- task:
type: low-light-image-enhancement
dataset:
name: LOL-v2-Synthetic
type: LOL-v2-Synthetic
metrics:
- type: PSNR
value: 30.15
name: PSNR
- type: SSIM
value: 0.951
name: SSIM
- task:
type: low-light-image-enhancement
dataset:
name: SDSD-indoor
type: SDSD-indoor
metrics:
- type: PSNR
value: 31.37
name: PSNR
- type: SSIM
value: 0.917
name: SSIM
- task:
type: low-light-image-enhancement
dataset:
name: SDSD-outdoor
type: SDSD-outdoor
metrics:
- type: PSNR
value: 31.73
name: PSNR
- type: SSIM
value: 0.904
name: SSIM
- task:
type: low-light-image-enhancement
dataset:
name: MEF
type: MEF
metrics:
- type: NIQE
value: 3.44
name: NIQE
- task:
type: low-light-image-enhancement
dataset:
name: LIME
type: LIME
metrics:
- type: NIQE
value: 3.82
name: NIQE
- task:
type: low-light-image-enhancement
dataset:
name: DICM
type: DICM
metrics:
- type: NIQE
value: 3.64
name: NIQE
- task:
type: low-light-image-enhancement
dataset:
name: NPE
type: NPE
metrics:
- type: NIQE
value: 3.55
name: NIQE
pipeline_tag: image-to-image
---
# ✨ ModalFormer: Multimodal Transformer for Low-Light Image Enhancement
<div align="center">
**[Alexandru Brateanu](https://scholar.google.com/citations?user=ru0meGgAAAAJ&hl=en), [Raul Balmez](https://scholar.google.com/citations?user=vPC7raQAAAAJ&hl=en), [Ciprian Orhei](https://scholar.google.com/citations?user=DZHdq3wAAAAJ&hl=en), [Codruta Ancuti](https://scholar.google.com/citations?user=5PA43eEAAAAJ&hl=en), [Cosmin Ancuti](https://scholar.google.com/citations?user=zVTgt8IAAAAJ&hl=en)**
[](https://arxiv.org/abs/2507.20388)
</div>
### Abstract
*Low-light image enhancement (LLIE) is a fundamental yet challenging task due to the presence of noise, loss of detail, and poor contrast in images captured under insufficient lighting conditions. Recent methods often rely solely on pixel-level transformations of RGB images, neglecting the rich contextual information available from multiple visual modalities. In this paper, we present ModalFormer, the first large-scale multimodal framework for LLIE that fully exploits nine auxiliary modalities to achieve state-of-the-art performance. Our model comprises two main components: a Cross-modal Transformer (CM-T) designed to restore corrupted images while seamlessly integrating multimodal information, and multiple auxiliary subnetworks dedicated to multimodal feature reconstruction. Central to the CM-T is our novel Cross-modal Multi-headed Self-Attention mechanism (CM-MSA), which effectively fuses RGB data with modality-specific features—including deep feature embeddings, segmentation information, geometric cues, and color information—to generate information-rich hybrid attention maps. Extensive experiments on multiple benchmark datasets demonstrate ModalFormer’s state-of-the-art performance in LLIE. Pre-trained models and results are made available at https://github.com/albrateanu/ModalFormer*
## 🆕 Updates
- `29.07.2025` 🎉 The [**ModalFormer**](https://arxiv.org/abs/2401.15204) paper is now available! Check it out and explore our results and methodology.
- `28.07.2025` 📦 Pre-trained models and test data published! ArXiv paper version and HuggingFace demo coming soon, stay tuned!
## ⚙️ Setup and Testing
For ease, utilize a Linux machine with CUDA-ready devices (GPUs).
To setup the environment, first run the provided setup script:
```bash
./environment_setup.sh
# or
bash environment_setup.sh
```
Note: in case of difficulties, ensure ```environment_setup.sh``` is executable by running:
```bash
chmod +x environment_setup.sh
```
Give the setup a couple of minutes to run.
Please check out the [**GitHub repository**](https://github.com/albrateanu/ModalFormer) for more implementation details.
## 📚 Citation
```
@misc{brateanu2025modalformer,
title={ModalFormer: Multimodal Transformer for Low-Light Image Enhancement},
author={Alexandru Brateanu and Raul Balmez and Ciprian Orhei and Codruta Ancuti and Cosmin Ancuti},
year={2025},
eprint={2507.20388},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.20388},
}
```
## 🙏 Acknowledgements
We use [this codebase](https://github.com/caiyuanhao1998/Retinexformer) as foundation for our implementation.
Paper: https://arxiv.org/pdf/2507.20388 |