File size: 5,461 Bytes
726aaad
 
24ac8c3
e8627e3
 
 
 
 
 
 
 
 
 
 
 
24ac8c3
e8627e3
 
 
 
 
 
 
 
ac71fc0
e8627e3
ac71fc0
 
e8627e3
ac71fc0
e8627e3
 
 
 
 
 
ac71fc0
e8627e3
ac71fc0
 
e8627e3
ac71fc0
e8627e3
 
 
 
 
 
ac71fc0
e8627e3
ac71fc0
 
e8627e3
ac71fc0
e8627e3
 
 
 
 
 
ac71fc0
e8627e3
ac71fc0
 
e8627e3
ac71fc0
e8627e3
 
 
 
 
 
ac71fc0
e8627e3
ac71fc0
 
e8627e3
ac71fc0
e8627e3
 
 
 
 
 
ac71fc0
e8627e3
ac71fc0
e8627e3
 
 
 
 
 
ac71fc0
e8627e3
ac71fc0
e8627e3
 
 
 
 
 
ac71fc0
e8627e3
ac71fc0
e8627e3
 
 
 
 
 
ac71fc0
e8627e3
ac71fc0
 
726aaad
 
 
 
 
 
 
 
ac71fc0
726aaad
 
 
 
 
 
 
 
 
 
ac71fc0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
726aaad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3678822
e8627e3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
license: mit
tags:
- low-light
- low-light-image-enhancement
- image-enhancement
- image-restoration
- computer-vision
- low-light-enhance
- multimodal
- multimodal-learning
- transformer
- transformers
- vision-transformer
- vision-transformers
model-index:
- name: ModalFormer
  results:
  - task:
      type: low-light-image-enhancement
    dataset:
      name: LOL-v1
      type: LOL-v1
    metrics:
    - type: PSNR
      value: 27.97
      name: PSNR
    - type: SSIM
      value: 0.897
      name: SSIM
  - task:
      type: low-light-image-enhancement
    dataset:
      name: LOL-v2-Real
      type: LOL-v2-Real
    metrics:
    - type: PSNR
      value: 29.33
      name: PSNR
    - type: SSIM
      value: 0.915
      name: SSIM
  - task:
      type: low-light-image-enhancement
    dataset:
      name: LOL-v2-Synthetic
      type: LOL-v2-Synthetic
    metrics:
    - type: PSNR
      value: 30.15
      name: PSNR
    - type: SSIM
      value: 0.951
      name: SSIM
  - task:
      type: low-light-image-enhancement
    dataset:
      name: SDSD-indoor
      type: SDSD-indoor
    metrics:
    - type: PSNR
      value: 31.37
      name: PSNR
    - type: SSIM
      value: 0.917
      name: SSIM
  - task:
      type: low-light-image-enhancement
    dataset:
      name: SDSD-outdoor
      type: SDSD-outdoor
    metrics:
    - type: PSNR
      value: 31.73
      name: PSNR
    - type: SSIM
      value: 0.904
      name: SSIM
  - task:
      type: low-light-image-enhancement
    dataset:
      name: MEF
      type: MEF
    metrics:
    - type: NIQE
      value: 3.44
      name: NIQE
  - task:
      type: low-light-image-enhancement
    dataset:
      name: LIME
      type: LIME
    metrics:
    - type: NIQE
      value: 3.82
      name: NIQE
  - task:
      type: low-light-image-enhancement
    dataset:
      name: DICM
      type: DICM
    metrics:
    - type: NIQE
      value: 3.64
      name: NIQE
  - task:
      type: low-light-image-enhancement
    dataset:
      name: NPE
      type: NPE
    metrics:
    - type: NIQE
      value: 3.55
      name: NIQE
pipeline_tag: image-to-image
---

# ✨ ModalFormer: Multimodal Transformer for Low-Light Image Enhancement

<div align="center">
  
**[Alexandru Brateanu](https://scholar.google.com/citations?user=ru0meGgAAAAJ&hl=en), [Raul Balmez](https://scholar.google.com/citations?user=vPC7raQAAAAJ&hl=en), [Ciprian Orhei](https://scholar.google.com/citations?user=DZHdq3wAAAAJ&hl=en), [Codruta Ancuti](https://scholar.google.com/citations?user=5PA43eEAAAAJ&hl=en), [Cosmin Ancuti](https://scholar.google.com/citations?user=zVTgt8IAAAAJ&hl=en)**

[![arXiv](https://img.shields.io/badge/arxiv-paper-179bd3)](https://arxiv.org/abs/2507.20388)
</div>

### Abstract
*Low-light image enhancement (LLIE) is a fundamental yet challenging task due to the presence of noise, loss of detail, and poor contrast in images captured under insufficient lighting conditions. Recent methods often rely solely on pixel-level transformations of RGB images, neglecting the rich contextual information available from multiple visual modalities. In this paper, we present ModalFormer, the first large-scale multimodal framework for LLIE that fully exploits nine auxiliary modalities to achieve state-of-the-art performance. Our model comprises two main components: a Cross-modal Transformer (CM-T) designed to restore corrupted images while seamlessly integrating multimodal information, and multiple auxiliary subnetworks dedicated to multimodal feature reconstruction. Central to the CM-T is our novel Cross-modal Multi-headed Self-Attention mechanism (CM-MSA), which effectively fuses RGB data with modality-specific features—including deep feature embeddings, segmentation information, geometric cues, and color information—to generate information-rich hybrid attention maps. Extensive experiments on multiple benchmark datasets demonstrate ModalFormer’s state-of-the-art performance in LLIE. Pre-trained models and results are made available at https://github.com/albrateanu/ModalFormer*

## 🆕 Updates
- `29.07.2025` 🎉 The [**ModalFormer**](https://arxiv.org/abs/2401.15204) paper is now available! Check it out and explore our results and methodology.
- `28.07.2025` 📦 Pre-trained models and test data published! ArXiv paper version and HuggingFace demo coming soon, stay tuned!

## ⚙️ Setup and Testing
For ease, utilize a Linux machine with CUDA-ready devices (GPUs).

To setup the environment, first run the provided setup script:

```bash
./environment_setup.sh
# or 
bash environment_setup.sh
```

Note: in case of difficulties, ensure ```environment_setup.sh``` is executable by running:

```bash
chmod +x environment_setup.sh
```

Give the setup a couple of minutes to run.

Please check out the [**GitHub repository**](https://github.com/albrateanu/ModalFormer) for more implementation details.

## 📚 Citation

```
@misc{brateanu2025modalformer,
      title={ModalFormer: Multimodal Transformer for Low-Light Image Enhancement}, 
      author={Alexandru Brateanu and Raul Balmez and Ciprian Orhei and Codruta Ancuti and Cosmin Ancuti},
      year={2025},
      eprint={2507.20388},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.20388}, 
}
```

## 🙏 Acknowledgements
We use [this codebase](https://github.com/caiyuanhao1998/Retinexformer) as foundation for our implementation.

Paper: https://arxiv.org/pdf/2507.20388