File size: 8,641 Bytes
476e3b3
 
 
 
6cb4c29
a8ec2da
67b1b7b
a8ec2da
476e3b3
 
 
a8ec2da
67b1b7b
a8ec2da
 
 
 
 
 
91b5e6f
 
 
 
 
a8ec2da
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
042d2c8
a8ec2da
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
276e4bf
a8ec2da
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
---
tags:
- model_hub_mixin
- pytorch_model_hub_mixin
- keypoint-matching
library_name: transformers
license: apache-2.0
pipeline_tag: keypoint-detection
---

This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:

This is a LightGlue variant trained on DISK, with a commecially permissive license, which requires `kornia` to be installed and is usable with transformers with the following lines of code
```python
from transformers import LightGlueForKeypointMatching

model = LightGlueForKeypointMatching.from_pretrained("ETH-CVG/lightglue_disk", trust_remote_code=True)
```

_Also, the commit allowing DISK to work with LightGlue is not yet included in a version of transformers, please install transformers from the main branch_
```
uv pip install git+https://github.com/huggingface/transformers.git
```

# LightGlue

The LightGlue model was proposed
in [LightGlue: Local Feature Matching at Light Speed](http://arxiv.org/abs/2306.13643) by Philipp Lindenberger, Paul-Edouard Sarlin and Marc Pollefeys.

This model consists of matching two sets of interest points detected in an image. Paired with the 
[SuperPoint model](https://huggingface.co/magic-leap-community/superpoint), it can be used to match two images and 
estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.

The abstract from the paper is the following :
We introduce LightGlue, a deep neural network that learns to match local features across images. We revisit multiple 
design decisions of SuperGlue, the state of the art in sparse matching, and derive simple but effective improvements. 
Cumulatively, they make LightGlue more efficient – in terms of both memory and computation, more accurate, and much 
easier to train. One key property is that LightGlue is adaptive to the difficulty of the problem: the inference is 
much faster on image pairs that are intuitively easy to match, for example because of a larger visual overlap or 
limited appearance change. This opens up exciting prospects for deploying deep matchers in latency-sensitive 
applications like 3D reconstruction. The code and trained models are publicly available at [github.com/cvg/LightGlue](https://github.com/cvg/LightGlue).


<img src="https://raw.githubusercontent.com/cvg/LightGlue/main/assets/easy_hard.jpg" alt="drawing" width="800"/>

This model was contributed by [stevenbucaille](https://huggingface.co/stevenbucaille).
The original code can be found [here](https://github.com/cvg/LightGlue).

## Demo notebook

A demo notebook showcasing inference + visualization with LightGlue can be found [TBD]().


## Model Details

### Model Description

LightGlue is a neural network that matches two sets of local features by jointly finding correspondences and rejecting non-matchable points.
Building on the success of SuperGlue, this model has the ability to introspect the confidence of its own predictions. It adapts the amount of
computation to the difficulty of each image pair to match. Both its depth and width are adaptive : 
1. the inference can stop at an early layer if all predictions are ready
2. points that are deemed not matchable are discarded early from further steps.
The resulting model, LightGlue, is finally faster, more accurate, and easier to train than the long-unrivaled SuperGlue.

<img src="https://cdn-uploads.huggingface.co/production/uploads/632885ba1558dac67c440aa8/ILpGyHuWwK2M9Bz0LmZLh.png" alt="drawing" width="1000"/>

- **Developed by:** ETH Zurich - Computer Vision and Geometry Lab 
- **Model type:** Image Matching
- **License:** Apache 2.0

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** https://github.com/cvg/LightGlue
- **Paper:** http://arxiv.org/abs/2306.13643
- **Demo:** https://colab.research.google.com/github/cvg/LightGlue/blob/main/demo.ipynb

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

### Direct Use

LightGlue is designed for feature matching and pose estimation tasks in computer vision. It can be applied to a variety of multiple-view 
geometry problems and can handle challenging real-world indoor and outdoor environments. However, it may not perform well on tasks that 
require different types of visual understanding, such as object detection or image classification.

## How to Get Started with the Model

Here is a quick example of using the model. Since this model is an image matching model, it requires pairs of images to be matched. 
The raw outputs contain the list of keypoints detected by the keypoint detector as well as the list of matches with their corresponding 
matching scores.
```python
from transformers import AutoImageProcessor, AutoModel
import torch
from PIL import Image
import requests

url_image1 = "https://raw.githubusercontent.com/magicleap/SuperGluePretrainedNetwork/refs/heads/master/assets/phototourism_sample_images/united_states_capitol_98169888_3347710852.jpg"
image1 = Image.open(requests.get(url_image1, stream=True).raw)
url_image2 = "https://raw.githubusercontent.com/magicleap/SuperGluePretrainedNetwork/refs/heads/master/assets/phototourism_sample_images/united_states_capitol_26757027_6717084061.jpg"
image2 = Image.open(requests.get(url_image2, stream=True).raw)

images = [image1, image2]

processor = AutoImageProcessor.from_pretrained("ETH-CVG/lightglue_disk", trust_remote_code=True)
model = AutoModel.from_pretrained("ETH-CVG/lightglue_disk")

inputs = processor(images, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
```

You can use the `post_process_keypoint_matching` method from the `LightGlueImageProcessor` to get the keypoints and matches in a readable format:
```python
image_sizes = [[(image.height, image.width) for image in images]]
outputs = processor.post_process_keypoint_matching(outputs, image_sizes, threshold=0.2)
for i, output in enumerate(outputs):
    print("For the image pair", i)
    for keypoint0, keypoint1, matching_score in zip(
            output["keypoints0"], output["keypoints1"], output["matching_scores"]
    ):
        print(
            f"Keypoint at coordinate {keypoint0.numpy()} in the first image matches with keypoint at coordinate {keypoint1.numpy()} in the second image with a score of {matching_score}."
        )
```

You can visualize the matches between the images by providing the original images as well as the outputs to this method:
```python
processor.plot_keypoint_matching(images, outputs)
```

![image/png](https://cdn-uploads.huggingface.co/production/uploads/632885ba1558dac67c440aa8/Fj2f2LbFEbfCEaC5Fsz3z.png)

## Training Details

LightGlue is trained on large annotated datasets for pose estimation, enabling it to learn priors for pose estimation and reason about the 3D scene.
The training data consists of image pairs with ground truth correspondences and unmatched keypoints derived from ground truth poses and depth maps.

LightGlue follows the supervised training setup of SuperGlue. It is first pre-trained with synthetic homographies sampled from 1M images. 
Such augmentations provide full and noise-free supervision but require careful tuning. LightGlue is then fine-tuned with the MegaDepth dataset, 
which includes 1M crowd-sourced images depicting 196 tourism landmarks, with camera calibration and poses recovered by SfM and 
dense depth by multi-view stereo.

#### Training Hyperparameters

- **Training regime:** fp32

#### Speeds, Sizes, Times

LightGlue is designed to be efficient and runs in real-time on a modern GPU. A forward pass takes approximately 44 milliseconds (22 FPS) for an image pair. 
The model has 13.7 million parameters, making it relatively compact compared to some other deep learning models.
The inference speed of LightGlue is suitable for real-time applications and can be readily integrated into 
modern Simultaneous Localization and Mapping (SLAM) or Structure-from-Motion (SfM) systems.

## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**

```bibtex
@inproceedings{lindenberger2023lightglue,
  author    = {Philipp Lindenberger and
               Paul-Edouard Sarlin and
               Marc Pollefeys},
  title     = {{LightGlue: Local Feature Matching at Light Speed}},
  booktitle = {ICCV},
  year      = {2023}
}
```

## Model Card Authors

[Steven Bucaille](https://github.com/sbucaille)