bfshi-nvidia commited on
Commit
61487e6
·
verified ·
1 Parent(s): a33476b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +114 -46
README.md CHANGED
@@ -5,9 +5,32 @@ base_model:
5
  - google/siglip-so400m-patch14-384
6
  pipeline_tag: image-feature-extraction
7
  ---
8
- <div align="center">
9
 
10
- # Scaling Vision Pre-Training to 4K Resolution
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
  [![website](https://img.shields.io/badge/website-76b900?style=for-the-badge&logo=safari&labelColor=555555)](https://nvlabs.github.io/PS3/)
13
  [![Arxiv](https://img.shields.io/badge/Arxiv-b31b1b?style=for-the-badge&logo=arxiv&labelColor=555555)](https://arxiv.org/abs/2503.19903)
@@ -15,31 +38,50 @@ pipeline_tag: image-feature-extraction
15
  [![PS3 Models](https://img.shields.io/badge/PS3%20Models%20-ffd21e?style=for-the-badge&logo=huggingface&labelColor=555555)](https://huggingface.co/collections/nvidia/ps3-scaling-vision-pre-training-to-4k-resolution-682d0535b61c07afd45242e9)
16
  [![VILA-HD Models](https://img.shields.io/badge/VILA--HD%20Models%20-ffd21e?style=for-the-badge&logo=huggingface&labelColor=555555)](https://huggingface.co/collections/nvidia/ps3-scaling-vision-pre-training-to-4k-resolution-682d0535b61c07afd45242e9)
17
  [![PS3 Code](https://img.shields.io/badge/PS3%20Code%20-181717?style=for-the-badge&logo=github&labelColor=555555)](https://github.com/NVlabs/PS3)
18
- [![VILA-HD Code](https://img.shields.io/badge/VILA--HD%20Code%20-181717?style=for-the-badge&logo=github&labelColor=555555)](https://github.com/NVlabs/VILA/tree/main/vila_hd)
19
-
20
- <div style="font-family: charter;">
21
- <a href="https://bfshi.github.io" target="_blank" style="color: #6f6f6f; text-decoration: none;">Baifeng Shi</a><sup style="font-size: 0.6em;">1,2</sup>&nbsp;&nbsp;&nbsp;
22
- <a href="https://sites.google.com/site/boyilics/home" target="_blank" style="color: #6f6f6f; text-decoration: none;">Boyi Li</a><sup style="font-size: 0.6em;">1,2</sup>&nbsp;&nbsp;&nbsp;
23
- <a href="https://han-cai.github.io/" target="_blank" style="color: #6f6f6f; text-decoration: none;">Han Cai</a><sup style="font-size: 0.6em;">2</sup>&nbsp;&nbsp;&nbsp;
24
- <a href="https://scholar.google.com/citations?user=OI7zFmwAAAAJ&hl=en/" target="_blank" style="color: #6f6f6f; text-decoration: none;">Yao Lu</a><sup style="font-size: 0.6em;">2</sup>&nbsp;&nbsp;&nbsp;
25
- <a href="https://sifeiliu.net/" target="_blank" style="color: #6f6f6f; text-decoration: none;">Sifei Liu</a><sup style="font-size: 0.6em;">2</sup>&nbsp;&nbsp;&nbsp;
26
- <a href="https://research.nvidia.com/person/marco-pavone" target="blank" style="color: #6f6f6f; text-decoration: none;">Marco Pavone</a><sup style="font-size: 0.6em;">2</sup>
27
- <br>
28
- <a href="https://jankautz.com/" target="_blank" style="color: #6f6f6f; text-decoration: none;">Jan Kautz</a><sup style="font-size: 0.6em;">2</sup>&nbsp;&nbsp;&nbsp;
29
- <a href="https://hanlab.mit.edu/songhan/" target="_blank" style="color: #6f6f6f; text-decoration: none;">Song Han</a><sup style="font-size: 0.6em;">2</sup>&nbsp;&nbsp;&nbsp;
30
- <a href="https://people.eecs.berkeley.edu/~trevor/" target="_blank" style="color: #6f6f6f; text-decoration: none;">Trevor Darrell</a><sup style="font-size: 0.6em;">1</sup>&nbsp;&nbsp;&nbsp;
31
- <a href="https://www.pmolchanov.com/" target="_blank" style="color: #6f6f6f; text-decoration: none;">Pavlo Molchanov</a><sup style="font-size: 0.6em;">2</sup>&nbsp;&nbsp;&nbsp;
32
- <a href="https://hongxu-yin.github.io/" target="_blank" style="color: #6f6f6f; text-decoration: none;">Hongxu Yin</a><sup style="font-size: 0.6em;">2</sup>
33
- <br>
34
- </a><sup style="font-size: 0.6em;">1</sup> UC Berkeley&nbsp;&nbsp;&nbsp;
35
- </a><sup style="font-size: 0.6em;">2</sup> NVIDIA&nbsp;&nbsp;&nbsp;
36
- </div>
37
-
38
- </div>
39
-
40
- <hr style="border: 2px solid gray;"></hr>
41
 
42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  ## Pre-Trained Models
44
 
45
  ### PS3 models
@@ -49,7 +91,31 @@ pipeline_tag: image-feature-extraction
49
  | PS3-1.5K-SigLIP | 1512 * 1512 | [nvidia/PS3-1.5K-SigLIP](https://huggingface.co/nvidia/PS3-1.5K-SigLIP) |
50
  | PS3-4K-SigLIP | 3780 * 3780 | [nvidia/PS3-4K-SigLIP](https://huggingface.co/nvidia/PS3-4K-SigLIP) |
51
 
52
- <hr style="border: 2px solid gray;"></hr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
  ## Performance
55
 
@@ -67,6 +133,10 @@ See Table 1 in the paper for full results.
67
  | SigLIP + S2 | | 3780 | 18225 | OOM | OOM | OOM | OOM | OOM | OOM | OOM | OOM |
68
  | **PS3-4K-SigLIP** | [nvidia/PS3-4K-SigLIP](https://huggingface.co/nvidia/PS3-4K-SigLIP) | 3780 | 3840 | 69.8 | 70.9 | 79.1 | 40.5 | 543 | 67.8 | 64.7 | 63.9 |
69
 
 
 
 
 
70
 
71
  ## Installation
72
 
@@ -75,16 +145,13 @@ Install through pip to use PS3 out of the box.
75
  pip install ps3-torch
76
  ```
77
 
78
- If you would like to make changes to the PS3 code, clone this repo and install in editable mode.
79
  ```bash
80
  cd PS3
81
  pip install -e .
82
  ```
83
 
84
- <hr style="border: 2px solid gray;"></hr>
85
-
86
-
87
- ## Quick Start
88
 
89
  Here we show example usage including
90
  - loading the model
@@ -111,7 +178,7 @@ x = processor(image)["pixel_values"][0].unsqueeze(0).cuda()
111
 
112
  ### 2. Encode High-Res Image with Bottom-Up Selection
113
 
114
- PS3 can select important high-res patches baed on visual saliency and encode those patches.
115
 
116
  **You can encode the whole high-res image using PS3.**
117
  ```python
@@ -132,9 +199,9 @@ outs = vision_model(x, num_look_close=2)
132
  features = outs.last_hidden_state
133
  print(features.shape) # (1, 5849, 1152)
134
  ```
135
- In this example, it only runs the high-res selection and encoding for twice.
136
 
137
- Note that PS3 processes at most 2560 high-res patches at a time. Then running high-res selection and encoding for twice gives us 2560 * 2 = 5120 high-res tokens. There is also 729 low-res tokens. That gives us 729 + 5120 = 5849 tokens in total.
138
 
139
  **You can also decide how many high-res tokens to process by setting `num_token_look_close`.**
140
  ```python
@@ -142,7 +209,7 @@ outs = vision_model(x, num_token_look_close=3000)
142
  features = outs.last_hidden_state
143
  print(features.shape) # (1, 3729, 1152)
144
  ```
145
- In this example, it only processes 3000 high-res tokens. Note that PS3 only processes 2560 high-res patches at a time. This means it needs to run the high-res selection and encoding for twice, with the first time processing 2560 high-res tokens and the second time processing 440 tokens. In the end it outputs 3729 tokens (3000 high-res + 729 low-res).
146
 
147
  **Visualize the bottom-up patch selection probabilities.**
148
  ```python
@@ -255,9 +322,7 @@ This will create a masked feature map `feature_maps` which is a list of feature
255
 
256
 
257
 
258
- <hr style="border: 2px solid gray;"></hr>
259
-
260
- ## Inference
261
 
262
  [Quick Start](#quick-start) gives some examples of how to use PS3 to encode an image. Below are more detailed explanations of the arguments of model inference.
263
 
@@ -282,28 +347,29 @@ class PS3VisionModel(PS3PreTrainedModel):
282
 
283
  `num_look_close`: how many times to run high-res selection and encoding. PS3 selects and processes 2560 patches each time. If set to `all` then it selects all the high-res patches. If set to `0` then PS3 only returns the low-res features. If set to a larger number than what it needs to encode all the high-res patches, then PS3 will clamp it to the max number needed.
284
 
285
- `num_token_look_close`: (optinoal) how many high-res patches to select and process. Similar to `num_look_close` but counts the number of high-res tokens instead of number of running high-res encoding.
286
 
287
- `prompt`: (optional) the prompt embedding used to select high-res patches. The prompt embedding can be embedding of some text, or some embedding output by an LLM (see paper). The shape of prompt embedding is (B, C) where B is the batch size (same in `pixel_values`) and C is the embedding dimension (same as PS3 token embedding dimension). If `prompt=None`, then PS3 will select high-res patches based on visual saliency (bottom-up selection).
288
 
289
- `gt_selection_maps`: (optional) the ground truth selection maps for the image. It should be a tensor of 0/1 values with shape (B, h, w). Regions with value 1 means they should be selected. When selectin high-res patches, PS3 will interpolate the `gt_selection_maps` to the same size as the feature map at each scale, prioritize selecting the tokens where the value is 1, and if there's still budget for selecting more tokens, select the rest based on the original selection probability.
290
 
291
- `smooth_selection_prob`: (optional) smooth the selectino probability map such that the selected patches won't be distributed too scarcely each time it runs high-res selection. It slightly improves the performance occasinoally when selecting all the patches but usually hurts when selecting parts of the patches.
292
 
293
- `only_select_first_n_scale`: (optional) only select the first n high-res scales. For example, for PS3-4K model, if `only_select_first_n_scale=2`, then only select and process scales of 756 and 1512, and ignore the scale of 3780.
294
 
295
  `is_global_text`: (optional) only return the pooled low-res feautres. *It will only be used during pre-training.*
296
 
297
  `pool_gt_token_only`: (optional) only pool the tokens inside the gt selection regions. *It will only be used during pre-training.*
298
 
299
 
300
-
301
- <hr style="border: 2px solid gray;"></hr>
302
 
303
 
304
  ## More Details
305
  Please refer to the [PS3 codebase](https://github.com/NVlabs/PS3) for more details.
306
 
 
307
  ## Citation
308
 
309
  If you find this work useful in your research, please consider citing:
@@ -315,4 +381,6 @@ If you find this work useful in your research, please consider citing:
315
  journal={arXiv preprint arXiv:2503.19903},
316
  year={2025}
317
  }
318
- ```
 
 
 
5
  - google/siglip-so400m-patch14-384
6
  pipeline_tag: image-feature-extraction
7
  ---
 
8
 
9
+ ## Description: <br>
10
+
11
+ PS3-1.5K-SigLIP is a vision encoder that extracts visual features from images of up to 1.5K resolution.
12
+
13
+ This model is for research and development only.
14
+
15
+ ### License/Terms of Use: <br>
16
+
17
+ NVIDIA license (see https://huggingface.co/nvidia/PS3-1.5K-SigLIP/blob/main/LICENSE.md)
18
+
19
+ ### Deployment Geography:
20
+
21
+ Global
22
+
23
+ ### Use Case: <br>
24
+
25
+ The model is used for extracting visual features from high-resolution images.
26
+
27
+ ### Release Date: <br>
28
+
29
+ Huggingface [05/30/2025] via [https://huggingface.co/nvidia/PS3-1.5K-SigLIP] <br>
30
+
31
+ ## Reference(s):
32
+
33
+ The model is from the paper [Scaling Vision Pre-Training to 4K Resolution](https://arxiv.org/abs/2503.19903). Useful links:
34
 
35
  [![website](https://img.shields.io/badge/website-76b900?style=for-the-badge&logo=safari&labelColor=555555)](https://nvlabs.github.io/PS3/)
36
  [![Arxiv](https://img.shields.io/badge/Arxiv-b31b1b?style=for-the-badge&logo=arxiv&labelColor=555555)](https://arxiv.org/abs/2503.19903)
 
38
  [![PS3 Models](https://img.shields.io/badge/PS3%20Models%20-ffd21e?style=for-the-badge&logo=huggingface&labelColor=555555)](https://huggingface.co/collections/nvidia/ps3-scaling-vision-pre-training-to-4k-resolution-682d0535b61c07afd45242e9)
39
  [![VILA-HD Models](https://img.shields.io/badge/VILA--HD%20Models%20-ffd21e?style=for-the-badge&logo=huggingface&labelColor=555555)](https://huggingface.co/collections/nvidia/ps3-scaling-vision-pre-training-to-4k-resolution-682d0535b61c07afd45242e9)
40
  [![PS3 Code](https://img.shields.io/badge/PS3%20Code%20-181717?style=for-the-badge&logo=github&labelColor=555555)](https://github.com/NVlabs/PS3)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
 
43
+ ## Model Architecture:
44
+ **Architecture Type:** Neural Network
45
+
46
+ **Network Architecture:** Vision Transformer designed for high-resolution images
47
+
48
+ This model was developed based on [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384). Please see training designs in the paper.
49
+
50
+
51
+ ## Input: <br>
52
+ **Input Type(s):** Image <br>
53
+ **Input Format:** Red, Green, Blue (RGB) <br>
54
+ **Input Parameters:** 2D <br>
55
+ **Other Properties Related to Input:** Image resolutions up to 1512*1512. <br>
56
+
57
+ ## Output: <br>
58
+ **Output Type(s):** Embeddings <br>
59
+ **Output Format:** Tensor <br>
60
+ **Output Parameters:** 1D <br>
61
+ **Other Properties Related to Output:** Downstream model required to leverage image features <br>
62
+
63
+ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
64
+
65
+ ## Software Integration:
66
+ **Runtime Engine(s):**
67
+ N/A <br>
68
+
69
+ **Supported Hardware Microarchitecture Compatibility:** <br>
70
+ NVIDIA Ampere <br>
71
+ NVIDIA Blackwell <br>
72
+ NVIDIA Jetson <br>
73
+ NVIDIA Hopper <br>
74
+
75
+ **Preferred/Supported Operating System(s):**
76
+ Linux <br>
77
+ Linux 4 Tegra <br>
78
+ QNX <br>
79
+ Windows <br>
80
+
81
+ ## Model Version(s):
82
+
83
+ v1.0 - Initial release
84
+
85
  ## Pre-Trained Models
86
 
87
  ### PS3 models
 
91
  | PS3-1.5K-SigLIP | 1512 * 1512 | [nvidia/PS3-1.5K-SigLIP](https://huggingface.co/nvidia/PS3-1.5K-SigLIP) |
92
  | PS3-4K-SigLIP | 3780 * 3780 | [nvidia/PS3-4K-SigLIP](https://huggingface.co/nvidia/PS3-4K-SigLIP) |
93
 
94
+ ## Training Datasets: <br>
95
+
96
+ 75M images <br>
97
+
98
+ 1 dataset that's built based on:
99
+ - SA-1B (https://ai.meta.com/datasets/segment-anything/)
100
+ - IDL (https://huggingface.co/datasets/pixparse/idl-wds)
101
+
102
+ Training: 100% <br>
103
+
104
+ ## Training Dataset:
105
+
106
+ **Link:**
107
+ We used the following dataset during developing PS3:
108
+ - SA-1B (https://ai.meta.com/datasets/segment-anything/)
109
+ - IDL (https://huggingface.co/datasets/pixparse/idl-wds)
110
+
111
+ **Data Collection Method by dataset:** <br>
112
+ Automated
113
+
114
+ **Labeling Method by dataset:** <br>
115
+ Automated
116
+
117
+ **Properties (Quantity, Dataset Descriptions, Sensor(s)):** <br>
118
+ 75M images with resolution up to 4Kx4K.
119
 
120
  ## Performance
121
 
 
133
  | SigLIP + S2 | | 3780 | 18225 | OOM | OOM | OOM | OOM | OOM | OOM | OOM | OOM |
134
  | **PS3-4K-SigLIP** | [nvidia/PS3-4K-SigLIP](https://huggingface.co/nvidia/PS3-4K-SigLIP) | 3780 | 3840 | 69.8 | 70.9 | 79.1 | 40.5 | 543 | 67.8 | 64.7 | 63.9 |
135
 
136
+ ## Inference:
137
+ **Acceleration Engine:** N/A <br>
138
+ **Test Hardware:** <br>
139
+ The model is tested on NVIDIA A100 GPU.
140
 
141
  ## Installation
142
 
 
145
  pip install ps3-torch
146
  ```
147
 
148
+ If you would like to make changes to the PS3 code, go to [PS3 repository](https://github.com/NVlabs/PS3), clone the repo, and install in editable mode.
149
  ```bash
150
  cd PS3
151
  pip install -e .
152
  ```
153
 
154
+ ## Inference - Quick Start
 
 
 
155
 
156
  Here we show example usage including
157
  - loading the model
 
178
 
179
  ### 2. Encode High-Res Image with Bottom-Up Selection
180
 
181
+ PS3 can select important high-res patches based on visual saliency and encode those patches.
182
 
183
  **You can encode the whole high-res image using PS3.**
184
  ```python
 
199
  features = outs.last_hidden_state
200
  print(features.shape) # (1, 5849, 1152)
201
  ```
202
+ In this example, it only runs the high-res selection and encoding twice.
203
 
204
+ Note that PS3 processes at most 2560 high-res patches at a time. Then running high-res selection and encoding twice gives us 2560 * 2 = 5120 high-res tokens. There is also 729 low-res tokens. That gives us 729 + 5120 = 5849 tokens in total.
205
 
206
  **You can also decide how many high-res tokens to process by setting `num_token_look_close`.**
207
  ```python
 
209
  features = outs.last_hidden_state
210
  print(features.shape) # (1, 3729, 1152)
211
  ```
212
+ In this example, it only processes 3000 high-res tokens. Note that PS3 only processes 2560 high-res patches at a time. This means it needs to run the high-res selection and encoding twice, with the first time processing 2560 high-res tokens and the second time processing 440 tokens. In the end it outputs 3729 tokens (3000 high-res + 729 low-res).
213
 
214
  **Visualize the bottom-up patch selection probabilities.**
215
  ```python
 
322
 
323
 
324
 
325
+ ## Inference instructions
 
 
326
 
327
  [Quick Start](#quick-start) gives some examples of how to use PS3 to encode an image. Below are more detailed explanations of the arguments of model inference.
328
 
 
347
 
348
  `num_look_close`: how many times to run high-res selection and encoding. PS3 selects and processes 2560 patches each time. If set to `all` then it selects all the high-res patches. If set to `0` then PS3 only returns the low-res features. If set to a larger number than what it needs to encode all the high-res patches, then PS3 will clamp it to the max number needed.
349
 
350
+ `num_token_look_close`: (optinoal) how many high-res patches to select and process. Similar to `num_look_close` but `num_token_look_close` directly specifies the number of high-res tokens instead of number of running high-res encoding.
351
 
352
+ `prompt`: (optional) the prompt embedding used to select high-res patches. The prompt embedding can be embedding of some text, or some embedding output by an LLM (see the paper). The shape of prompt embedding is (B, C) where B is the batch size (same in `pixel_values`) and C is the embedding dimension (same as PS3 token embedding dimension). If `prompt=None`, then PS3 will select high-res patches based on visual saliency (bottom-up selection).
353
 
354
+ `gt_selection_maps`: (optional) the ground truth selection maps for the image. It should be a tensor of 0/1 values with shape (B, h, w). Regions with value 1 means they should be selected. When selecting high-res patches, PS3 will interpolate the `gt_selection_maps` to the same size as the feature map at each scale, prioritize selecting the tokens where the value is 1, and if there's still budget for selecting more tokens, it will select the rest based on the original selection probability.
355
 
356
+ `smooth_selection_prob`: (optional) smooth the selection probability map such that the selected patches won't be distributed too scarcely each time it runs high-res selection. It slightly improves the performance occasinoally when selecting all the patches but usually hurts when selecting parts of the patches.
357
 
358
+ `only_select_first_n_scale`: (optional) only select the first n high-res scales. For example, for PS3-4K model, if `only_select_first_n_scale=2`, then it only selects and processes scales of 756 and 1512, and ignores the scale of 3780.
359
 
360
  `is_global_text`: (optional) only return the pooled low-res feautres. *It will only be used during pre-training.*
361
 
362
  `pool_gt_token_only`: (optional) only pool the tokens inside the gt selection regions. *It will only be used during pre-training.*
363
 
364
 
365
+ ### Ethical Considerations:
366
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
367
 
368
 
369
  ## More Details
370
  Please refer to the [PS3 codebase](https://github.com/NVlabs/PS3) for more details.
371
 
372
+
373
  ## Citation
374
 
375
  If you find this work useful in your research, please consider citing:
 
381
  journal={arXiv preprint arXiv:2503.19903},
382
  year={2025}
383
  }
384
+ ```
385
+
386
+