MaxSup: Overcoming Representation Collapse in Label Smoothing
Max Suppression (MaxSup) is a novel regularization technique that overcomes the shortcomings of traditional Label Smoothing (LS). While LS prevents overconfidence by softening one-hot labels, it inadvertently collapses intra-class feature diversity and can boost overconfident errors. In contrast, MaxSup applies a uniform smoothing penalty to the model’s top prediction—regardless of correctness—preserving richer per-sample information and improving both classification performance and downstream transfer.
Table of Contents
- Overview
- Methodology: MaxSup vs. Label Smoothing
- Enhanced Feature Representation
- Training Vision Transformers with MaxSup
- Pretrained Weights
- Training ConvNets with MaxSup
- Logit Characteristic Visualization
- Citation
- References
Overview
Traditional Label Smoothing (LS) replaces one-hot labels with a smoothed version to reduce overconfidence. However, LS can over-tighten feature clusters within each class and may reinforce errors by making mispredictions overconfident. MaxSup tackles these issues by applying a smoothing penalty to the model's top-1 logit output regardless of whether the prediction is correct, thus preserving intra-class diversity and enhancing inter-class separation. The result is improved performance on both classification tasks and downstream applications such as linear transfer and image segmentation.
Methodology: MaxSup vs. Label Smoothing
Label Smoothing softens the target distribution by blending the one-hot vector with a uniform distribution. Although effective at reducing overconfidence, LS inadvertently introduces two effects:
- A regularization term that limits the sharpness of predictions.
- An error-enhancement term that can cause overconfident wrong predictions.
MaxSup addresses this by uniformly penalizing the highest logit output, whether it corresponds to the true class or not. This approach enforces a consistent regularization effect across all samples. In formula form:
L_{\text{MaxSup}} = \alpha \left( z_{\max} - \frac{1}{K}\sum_{k=1}^{K} z_k \right),
where ( z_{\max} ) is the highest logit among the ( K ) classes. This mechanism prevents the prediction distribution from becoming too peaky while preserving informative signals from non-target classes.
Enhanced Feature Representation
Qualitative Evaluation
MaxSup-trained models display richer intra-class feature diversity compared to models trained with traditional LS. Feature embedding visualizations show that while LS forces features into tight clusters, MaxSup preserves finer-grained differences among samples. Grad-CAM analyses also demonstrate that MaxSup-trained models focus more precisely on relevant class-discriminative regions.
Figure 1: Feature representations. MaxSup maintains greater intra-class diversity and clear inter-class boundaries.
Figure 2: Grad-CAM visualizations. The MaxSup model (row 2) accurately highlights target objects, whereas the LS model (row 3) and Baseline (row 4) show more diffuse activations.
Quantitative Evaluation
We evaluated feature representations on ResNet-50 trained on ImageNet-1K. Intra-class variation (reflecting the diversity within classes) and inter-class separability (indicating class distinctiveness) were measured. Additionally, a linear transfer learning task on CIFAR-10 was performed.
Table 1: Feature Representation Metrics (ResNet-50 on ImageNet-1K)
Method | Intra-class Var. (Train) | Intra-class Var. (Val) | Inter-class Sep. (Train) | Inter-class Sep. (Val) |
---|---|---|---|---|
Baseline | 0.3114 | 0.3313 | 0.4025 | 0.4451 |
Label Smoothing | 0.2632 | 0.2543 | 0.4690 | 0.4611 |
Online LS | 0.2707 | 0.2820 | 0.5943 | 0.5708 |
Zipf’s LS | 0.2611 | 0.2932 | 0.5522 | 0.4790 |
MaxSup (ours) | 0.2926 | 0.2998 | 0.5188 | 0.4972 |
Higher intra-class variation indicates more preserved sample-specific details, while higher inter-class separability suggests better class discrimination.
Table 2: Linear Transfer Accuracy on CIFAR-10
Pretraining Method | Accuracy (%) |
---|---|
Baseline | 81.43 |
Label Smoothing | 74.58 |
MaxSup | 81.02 |
Label Smoothing degrades transfer accuracy due to its over-smoothing effect, whereas MaxSup nearly matches the baseline performance while still offering improved calibration.
Training Vision Transformers with MaxSup
We integrated MaxSup into the training pipeline for Vision Transformers using the DeiT framework.
To Train a ViT with MaxSup:
cd Deit
python train_with_MaxSup.sh
This script trains a DeiT-Small model on ImageNet-1K with MaxSup regularization.
Accelerated Data Loading via Caching (Optional)
For improved data loading efficiency on systems with slow I/O, a caching mechanism is provided. This feature compresses the ImageNet dataset into ZIP files and loads them into memory. Enable caching by adding the --cache
flag to the training script.
Preparing Data and Annotations for Caching
Create ZIP Archives:
In your ImageNet data directory, run:cd data/ImageNet zip -r train.zip train zip -r val.zip val
Mapping Files:
Downloadtrain_map.txt
andval_map.txt
from our release assets and place them in thedata/ImageNet
directory. The directory should appear as follows:data/ImageNet/ ├── train_map.txt # Relative paths and labels for training images ├── val_map.txt # Relative paths and labels for validation images ├── train.zip # Compressed training images └── val.zip # Compressed validation images
- train_map.txt: Each line should be in the format
<class_folder>/<image_filename>\t<label>
. - val_map.txt: Each line should be in the format
<image_filename>\t<label>
.
- train_map.txt: Each line should be in the format
Pretrained Weights
- ConvNet (ResNet-50): Pretrained weights can be downloaded from this page.
These checkpoints can be used for direct evaluation or fine-tuning on downstream tasks.
Training ConvNets with MaxSup
The Conv/
directory provides scripts for training convolutional networks with MaxSup:
- Conv/ffcv: Contains scripts to reproduce ImageNet results using FFCV for efficient data loading. See
Conv/ffcv/README.md
for details. - Conv/common_resnet: Contains additional experiments with ResNet architectures. Refer to
Conv/common_resnet/README.md
for further instructions.
Logit Characteristic Visualization
The viz/
directory contains a toolkit to analyze the distribution of logits produced by models trained with LS versus MaxSup.
Step 1: Extract Logits
Run the following command to extract logits from your trained model:
python viz/logits.py \
--checkpoint /path/to/model_checkpoint.pth \
--output /path/to/save/logits_labels.pt
--checkpoint
: Path to your model checkpoint.--output
: Destination file for the extracted logits and labels.
Step 2: Analyze Logits
After extraction, run:
python viz/analysis.py --input /path/to/save/logits_labels.pt --output /path/to/analysis_results/
This script generates:
- A histogram of near-zero logit proportions.
- A scatter plot comparing top-1 probabilities with near-zero proportions.
- Saved visualizations for side-by-side comparisons.
Figure 3: Logit distribution comparing LS and MaxSup.
For more details, see our paper on arXiv:2502.15798.
References
- DeiT (Vision Transformer):
Touvron et al., Training Data-Efficient Image Transformers & Distillation through Attention, ICML 2021. GitHub. - Grad-CAM:
Selvaraju et al., Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization, ICCV 2017. - Online Label Smoothing: See paper for details.
- Zipf’s Label Smoothing: See paper for details.
This repository provides the official implementation of MaxSup. Contributions and discussions are welcome. For any questions or issues, please open an issue on GitHub or contact the authors directly.