Update README.md
Browse files
README.md
CHANGED
@@ -5,9 +5,32 @@ base_model:
|
|
5 |
- google/siglip-so400m-patch14-384
|
6 |
pipeline_tag: image-feature-extraction
|
7 |
---
|
8 |
-
<div align="center">
|
9 |
|
10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
|
12 |
[](https://nvlabs.github.io/PS3/)
|
13 |
[](https://arxiv.org/abs/2503.19903)
|
@@ -15,31 +38,50 @@ pipeline_tag: image-feature-extraction
|
|
15 |
[](https://huggingface.co/collections/nvidia/ps3-scaling-vision-pre-training-to-4k-resolution-682d0535b61c07afd45242e9)
|
16 |
[](https://huggingface.co/collections/nvidia/ps3-scaling-vision-pre-training-to-4k-resolution-682d0535b61c07afd45242e9)
|
17 |
[](https://github.com/NVlabs/PS3)
|
18 |
-
[](https://github.com/NVlabs/VILA/tree/main/vila_hd)
|
19 |
-
|
20 |
-
<div style="font-family: charter;">
|
21 |
-
<a href="https://bfshi.github.io" target="_blank" style="color: #6f6f6f; text-decoration: none;">Baifeng Shi</a><sup style="font-size: 0.6em;">1,2</sup>
|
22 |
-
<a href="https://sites.google.com/site/boyilics/home" target="_blank" style="color: #6f6f6f; text-decoration: none;">Boyi Li</a><sup style="font-size: 0.6em;">1,2</sup>
|
23 |
-
<a href="https://han-cai.github.io/" target="_blank" style="color: #6f6f6f; text-decoration: none;">Han Cai</a><sup style="font-size: 0.6em;">2</sup>
|
24 |
-
<a href="https://scholar.google.com/citations?user=OI7zFmwAAAAJ&hl=en/" target="_blank" style="color: #6f6f6f; text-decoration: none;">Yao Lu</a><sup style="font-size: 0.6em;">2</sup>
|
25 |
-
<a href="https://sifeiliu.net/" target="_blank" style="color: #6f6f6f; text-decoration: none;">Sifei Liu</a><sup style="font-size: 0.6em;">2</sup>
|
26 |
-
<a href="https://research.nvidia.com/person/marco-pavone" target="blank" style="color: #6f6f6f; text-decoration: none;">Marco Pavone</a><sup style="font-size: 0.6em;">2</sup>
|
27 |
-
<br>
|
28 |
-
<a href="https://jankautz.com/" target="_blank" style="color: #6f6f6f; text-decoration: none;">Jan Kautz</a><sup style="font-size: 0.6em;">2</sup>
|
29 |
-
<a href="https://hanlab.mit.edu/songhan/" target="_blank" style="color: #6f6f6f; text-decoration: none;">Song Han</a><sup style="font-size: 0.6em;">2</sup>
|
30 |
-
<a href="https://people.eecs.berkeley.edu/~trevor/" target="_blank" style="color: #6f6f6f; text-decoration: none;">Trevor Darrell</a><sup style="font-size: 0.6em;">1</sup>
|
31 |
-
<a href="https://www.pmolchanov.com/" target="_blank" style="color: #6f6f6f; text-decoration: none;">Pavlo Molchanov</a><sup style="font-size: 0.6em;">2</sup>
|
32 |
-
<a href="https://hongxu-yin.github.io/" target="_blank" style="color: #6f6f6f; text-decoration: none;">Hongxu Yin</a><sup style="font-size: 0.6em;">2</sup>
|
33 |
-
<br>
|
34 |
-
</a><sup style="font-size: 0.6em;">1</sup> UC Berkeley
|
35 |
-
</a><sup style="font-size: 0.6em;">2</sup> NVIDIA
|
36 |
-
</div>
|
37 |
-
|
38 |
-
</div>
|
39 |
-
|
40 |
-
<hr style="border: 2px solid gray;"></hr>
|
41 |
|
42 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
43 |
## Pre-Trained Models
|
44 |
|
45 |
### PS3 models
|
@@ -49,7 +91,31 @@ pipeline_tag: image-feature-extraction
|
|
49 |
| PS3-1.5K-SigLIP | 1512 * 1512 | [nvidia/PS3-1.5K-SigLIP](https://huggingface.co/nvidia/PS3-1.5K-SigLIP) |
|
50 |
| PS3-4K-SigLIP | 3780 * 3780 | [nvidia/PS3-4K-SigLIP](https://huggingface.co/nvidia/PS3-4K-SigLIP) |
|
51 |
|
52 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
53 |
|
54 |
## Performance
|
55 |
|
@@ -67,6 +133,10 @@ See Table 1 in the paper for full results.
|
|
67 |
| SigLIP + S2 | | 3780 | 18225 | OOM | OOM | OOM | OOM | OOM | OOM | OOM | OOM |
|
68 |
| **PS3-4K-SigLIP** | [nvidia/PS3-4K-SigLIP](https://huggingface.co/nvidia/PS3-4K-SigLIP) | 3780 | 3840 | 69.8 | 70.9 | 79.1 | 40.5 | 543 | 67.8 | 64.7 | 63.9 |
|
69 |
|
|
|
|
|
|
|
|
|
70 |
|
71 |
## Installation
|
72 |
|
@@ -75,16 +145,13 @@ Install through pip to use PS3 out of the box.
|
|
75 |
pip install ps3-torch
|
76 |
```
|
77 |
|
78 |
-
If you would like to make changes to the PS3 code, clone
|
79 |
```bash
|
80 |
cd PS3
|
81 |
pip install -e .
|
82 |
```
|
83 |
|
84 |
-
|
85 |
-
|
86 |
-
|
87 |
-
## Quick Start
|
88 |
|
89 |
Here we show example usage including
|
90 |
- loading the model
|
@@ -111,7 +178,7 @@ x = processor(image)["pixel_values"][0].unsqueeze(0).cuda()
|
|
111 |
|
112 |
### 2. Encode High-Res Image with Bottom-Up Selection
|
113 |
|
114 |
-
PS3 can select important high-res patches
|
115 |
|
116 |
**You can encode the whole high-res image using PS3.**
|
117 |
```python
|
@@ -132,9 +199,9 @@ outs = vision_model(x, num_look_close=2)
|
|
132 |
features = outs.last_hidden_state
|
133 |
print(features.shape) # (1, 5849, 1152)
|
134 |
```
|
135 |
-
In this example, it only runs the high-res selection and encoding
|
136 |
|
137 |
-
Note that PS3 processes at most 2560 high-res patches at a time. Then running high-res selection and encoding
|
138 |
|
139 |
**You can also decide how many high-res tokens to process by setting `num_token_look_close`.**
|
140 |
```python
|
@@ -142,7 +209,7 @@ outs = vision_model(x, num_token_look_close=3000)
|
|
142 |
features = outs.last_hidden_state
|
143 |
print(features.shape) # (1, 3729, 1152)
|
144 |
```
|
145 |
-
In this example, it only processes 3000 high-res tokens. Note that PS3 only processes 2560 high-res patches at a time. This means it needs to run the high-res selection and encoding
|
146 |
|
147 |
**Visualize the bottom-up patch selection probabilities.**
|
148 |
```python
|
@@ -255,9 +322,7 @@ This will create a masked feature map `feature_maps` which is a list of feature
|
|
255 |
|
256 |
|
257 |
|
258 |
-
|
259 |
-
|
260 |
-
## Inference
|
261 |
|
262 |
[Quick Start](#quick-start) gives some examples of how to use PS3 to encode an image. Below are more detailed explanations of the arguments of model inference.
|
263 |
|
@@ -282,28 +347,29 @@ class PS3VisionModel(PS3PreTrainedModel):
|
|
282 |
|
283 |
`num_look_close`: how many times to run high-res selection and encoding. PS3 selects and processes 2560 patches each time. If set to `all` then it selects all the high-res patches. If set to `0` then PS3 only returns the low-res features. If set to a larger number than what it needs to encode all the high-res patches, then PS3 will clamp it to the max number needed.
|
284 |
|
285 |
-
`num_token_look_close`: (optinoal) how many high-res patches to select and process. Similar to `num_look_close` but
|
286 |
|
287 |
-
`prompt`: (optional) the prompt embedding used to select high-res patches. The prompt embedding can be embedding of some text, or some embedding output by an LLM (see paper). The shape of prompt embedding is (B, C) where B is the batch size (same in `pixel_values`) and C is the embedding dimension (same as PS3 token embedding dimension). If `prompt=None`, then PS3 will select high-res patches based on visual saliency (bottom-up selection).
|
288 |
|
289 |
-
`gt_selection_maps`: (optional) the ground truth selection maps for the image. It should be a tensor of 0/1 values with shape (B, h, w). Regions with value 1 means they should be selected. When
|
290 |
|
291 |
-
`smooth_selection_prob`: (optional) smooth the
|
292 |
|
293 |
-
`only_select_first_n_scale`: (optional) only select the first n high-res scales. For example, for PS3-4K model, if `only_select_first_n_scale=2`, then only
|
294 |
|
295 |
`is_global_text`: (optional) only return the pooled low-res feautres. *It will only be used during pre-training.*
|
296 |
|
297 |
`pool_gt_token_only`: (optional) only pool the tokens inside the gt selection regions. *It will only be used during pre-training.*
|
298 |
|
299 |
|
300 |
-
|
301 |
-
|
302 |
|
303 |
|
304 |
## More Details
|
305 |
Please refer to the [PS3 codebase](https://github.com/NVlabs/PS3) for more details.
|
306 |
|
|
|
307 |
## Citation
|
308 |
|
309 |
If you find this work useful in your research, please consider citing:
|
@@ -315,4 +381,6 @@ If you find this work useful in your research, please consider citing:
|
|
315 |
journal={arXiv preprint arXiv:2503.19903},
|
316 |
year={2025}
|
317 |
}
|
318 |
-
```
|
|
|
|
|
|
5 |
- google/siglip-so400m-patch14-384
|
6 |
pipeline_tag: image-feature-extraction
|
7 |
---
|
|
|
8 |
|
9 |
+
## Description: <br>
|
10 |
+
|
11 |
+
PS3-1.5K-SigLIP is a vision encoder that extracts visual features from images of up to 1.5K resolution.
|
12 |
+
|
13 |
+
This model is for research and development only.
|
14 |
+
|
15 |
+
### License/Terms of Use: <br>
|
16 |
+
|
17 |
+
NVIDIA license (see https://huggingface.co/nvidia/PS3-1.5K-SigLIP/blob/main/LICENSE.md)
|
18 |
+
|
19 |
+
### Deployment Geography:
|
20 |
+
|
21 |
+
Global
|
22 |
+
|
23 |
+
### Use Case: <br>
|
24 |
+
|
25 |
+
The model is used for extracting visual features from high-resolution images.
|
26 |
+
|
27 |
+
### Release Date: <br>
|
28 |
+
|
29 |
+
Huggingface [05/30/2025] via [https://huggingface.co/nvidia/PS3-1.5K-SigLIP] <br>
|
30 |
+
|
31 |
+
## Reference(s):
|
32 |
+
|
33 |
+
The model is from the paper [Scaling Vision Pre-Training to 4K Resolution](https://arxiv.org/abs/2503.19903). Useful links:
|
34 |
|
35 |
[](https://nvlabs.github.io/PS3/)
|
36 |
[](https://arxiv.org/abs/2503.19903)
|
|
|
38 |
[](https://huggingface.co/collections/nvidia/ps3-scaling-vision-pre-training-to-4k-resolution-682d0535b61c07afd45242e9)
|
39 |
[](https://huggingface.co/collections/nvidia/ps3-scaling-vision-pre-training-to-4k-resolution-682d0535b61c07afd45242e9)
|
40 |
[](https://github.com/NVlabs/PS3)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
41 |
|
42 |
|
43 |
+
## Model Architecture:
|
44 |
+
**Architecture Type:** Neural Network
|
45 |
+
|
46 |
+
**Network Architecture:** Vision Transformer designed for high-resolution images
|
47 |
+
|
48 |
+
This model was developed based on [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384). Please see training designs in the paper.
|
49 |
+
|
50 |
+
|
51 |
+
## Input: <br>
|
52 |
+
**Input Type(s):** Image <br>
|
53 |
+
**Input Format:** Red, Green, Blue (RGB) <br>
|
54 |
+
**Input Parameters:** 2D <br>
|
55 |
+
**Other Properties Related to Input:** Image resolutions up to 1512*1512. <br>
|
56 |
+
|
57 |
+
## Output: <br>
|
58 |
+
**Output Type(s):** Embeddings <br>
|
59 |
+
**Output Format:** Tensor <br>
|
60 |
+
**Output Parameters:** 1D <br>
|
61 |
+
**Other Properties Related to Output:** Downstream model required to leverage image features <br>
|
62 |
+
|
63 |
+
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
|
64 |
+
|
65 |
+
## Software Integration:
|
66 |
+
**Runtime Engine(s):**
|
67 |
+
N/A <br>
|
68 |
+
|
69 |
+
**Supported Hardware Microarchitecture Compatibility:** <br>
|
70 |
+
NVIDIA Ampere <br>
|
71 |
+
NVIDIA Blackwell <br>
|
72 |
+
NVIDIA Jetson <br>
|
73 |
+
NVIDIA Hopper <br>
|
74 |
+
|
75 |
+
**Preferred/Supported Operating System(s):**
|
76 |
+
Linux <br>
|
77 |
+
Linux 4 Tegra <br>
|
78 |
+
QNX <br>
|
79 |
+
Windows <br>
|
80 |
+
|
81 |
+
## Model Version(s):
|
82 |
+
|
83 |
+
v1.0 - Initial release
|
84 |
+
|
85 |
## Pre-Trained Models
|
86 |
|
87 |
### PS3 models
|
|
|
91 |
| PS3-1.5K-SigLIP | 1512 * 1512 | [nvidia/PS3-1.5K-SigLIP](https://huggingface.co/nvidia/PS3-1.5K-SigLIP) |
|
92 |
| PS3-4K-SigLIP | 3780 * 3780 | [nvidia/PS3-4K-SigLIP](https://huggingface.co/nvidia/PS3-4K-SigLIP) |
|
93 |
|
94 |
+
## Training Datasets: <br>
|
95 |
+
|
96 |
+
75M images <br>
|
97 |
+
|
98 |
+
1 dataset that's built based on:
|
99 |
+
- SA-1B (https://ai.meta.com/datasets/segment-anything/)
|
100 |
+
- IDL (https://huggingface.co/datasets/pixparse/idl-wds)
|
101 |
+
|
102 |
+
Training: 100% <br>
|
103 |
+
|
104 |
+
## Training Dataset:
|
105 |
+
|
106 |
+
**Link:**
|
107 |
+
We used the following dataset during developing PS3:
|
108 |
+
- SA-1B (https://ai.meta.com/datasets/segment-anything/)
|
109 |
+
- IDL (https://huggingface.co/datasets/pixparse/idl-wds)
|
110 |
+
|
111 |
+
**Data Collection Method by dataset:** <br>
|
112 |
+
Automated
|
113 |
+
|
114 |
+
**Labeling Method by dataset:** <br>
|
115 |
+
Automated
|
116 |
+
|
117 |
+
**Properties (Quantity, Dataset Descriptions, Sensor(s)):** <br>
|
118 |
+
75M images with resolution up to 4Kx4K.
|
119 |
|
120 |
## Performance
|
121 |
|
|
|
133 |
| SigLIP + S2 | | 3780 | 18225 | OOM | OOM | OOM | OOM | OOM | OOM | OOM | OOM |
|
134 |
| **PS3-4K-SigLIP** | [nvidia/PS3-4K-SigLIP](https://huggingface.co/nvidia/PS3-4K-SigLIP) | 3780 | 3840 | 69.8 | 70.9 | 79.1 | 40.5 | 543 | 67.8 | 64.7 | 63.9 |
|
135 |
|
136 |
+
## Inference:
|
137 |
+
**Acceleration Engine:** N/A <br>
|
138 |
+
**Test Hardware:** <br>
|
139 |
+
The model is tested on NVIDIA A100 GPU.
|
140 |
|
141 |
## Installation
|
142 |
|
|
|
145 |
pip install ps3-torch
|
146 |
```
|
147 |
|
148 |
+
If you would like to make changes to the PS3 code, go to [PS3 repository](https://github.com/NVlabs/PS3), clone the repo, and install in editable mode.
|
149 |
```bash
|
150 |
cd PS3
|
151 |
pip install -e .
|
152 |
```
|
153 |
|
154 |
+
## Inference - Quick Start
|
|
|
|
|
|
|
155 |
|
156 |
Here we show example usage including
|
157 |
- loading the model
|
|
|
178 |
|
179 |
### 2. Encode High-Res Image with Bottom-Up Selection
|
180 |
|
181 |
+
PS3 can select important high-res patches based on visual saliency and encode those patches.
|
182 |
|
183 |
**You can encode the whole high-res image using PS3.**
|
184 |
```python
|
|
|
199 |
features = outs.last_hidden_state
|
200 |
print(features.shape) # (1, 5849, 1152)
|
201 |
```
|
202 |
+
In this example, it only runs the high-res selection and encoding twice.
|
203 |
|
204 |
+
Note that PS3 processes at most 2560 high-res patches at a time. Then running high-res selection and encoding twice gives us 2560 * 2 = 5120 high-res tokens. There is also 729 low-res tokens. That gives us 729 + 5120 = 5849 tokens in total.
|
205 |
|
206 |
**You can also decide how many high-res tokens to process by setting `num_token_look_close`.**
|
207 |
```python
|
|
|
209 |
features = outs.last_hidden_state
|
210 |
print(features.shape) # (1, 3729, 1152)
|
211 |
```
|
212 |
+
In this example, it only processes 3000 high-res tokens. Note that PS3 only processes 2560 high-res patches at a time. This means it needs to run the high-res selection and encoding twice, with the first time processing 2560 high-res tokens and the second time processing 440 tokens. In the end it outputs 3729 tokens (3000 high-res + 729 low-res).
|
213 |
|
214 |
**Visualize the bottom-up patch selection probabilities.**
|
215 |
```python
|
|
|
322 |
|
323 |
|
324 |
|
325 |
+
## Inference instructions
|
|
|
|
|
326 |
|
327 |
[Quick Start](#quick-start) gives some examples of how to use PS3 to encode an image. Below are more detailed explanations of the arguments of model inference.
|
328 |
|
|
|
347 |
|
348 |
`num_look_close`: how many times to run high-res selection and encoding. PS3 selects and processes 2560 patches each time. If set to `all` then it selects all the high-res patches. If set to `0` then PS3 only returns the low-res features. If set to a larger number than what it needs to encode all the high-res patches, then PS3 will clamp it to the max number needed.
|
349 |
|
350 |
+
`num_token_look_close`: (optinoal) how many high-res patches to select and process. Similar to `num_look_close` but `num_token_look_close` directly specifies the number of high-res tokens instead of number of running high-res encoding.
|
351 |
|
352 |
+
`prompt`: (optional) the prompt embedding used to select high-res patches. The prompt embedding can be embedding of some text, or some embedding output by an LLM (see the paper). The shape of prompt embedding is (B, C) where B is the batch size (same in `pixel_values`) and C is the embedding dimension (same as PS3 token embedding dimension). If `prompt=None`, then PS3 will select high-res patches based on visual saliency (bottom-up selection).
|
353 |
|
354 |
+
`gt_selection_maps`: (optional) the ground truth selection maps for the image. It should be a tensor of 0/1 values with shape (B, h, w). Regions with value 1 means they should be selected. When selecting high-res patches, PS3 will interpolate the `gt_selection_maps` to the same size as the feature map at each scale, prioritize selecting the tokens where the value is 1, and if there's still budget for selecting more tokens, it will select the rest based on the original selection probability.
|
355 |
|
356 |
+
`smooth_selection_prob`: (optional) smooth the selection probability map such that the selected patches won't be distributed too scarcely each time it runs high-res selection. It slightly improves the performance occasinoally when selecting all the patches but usually hurts when selecting parts of the patches.
|
357 |
|
358 |
+
`only_select_first_n_scale`: (optional) only select the first n high-res scales. For example, for PS3-4K model, if `only_select_first_n_scale=2`, then it only selects and processes scales of 756 and 1512, and ignores the scale of 3780.
|
359 |
|
360 |
`is_global_text`: (optional) only return the pooled low-res feautres. *It will only be used during pre-training.*
|
361 |
|
362 |
`pool_gt_token_only`: (optional) only pool the tokens inside the gt selection regions. *It will only be used during pre-training.*
|
363 |
|
364 |
|
365 |
+
### Ethical Considerations:
|
366 |
+
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
|
367 |
|
368 |
|
369 |
## More Details
|
370 |
Please refer to the [PS3 codebase](https://github.com/NVlabs/PS3) for more details.
|
371 |
|
372 |
+
|
373 |
## Citation
|
374 |
|
375 |
If you find this work useful in your research, please consider citing:
|
|
|
381 |
journal={arXiv preprint arXiv:2503.19903},
|
382 |
year={2025}
|
383 |
}
|
384 |
+
```
|
385 |
+
|
386 |
+
|