Spaces:
Running
Running
Image Preprocessing Service
This service automatically processes various image formats during upload to ensure compatibility and optimal storage.
Overview
The ImagePreprocessor
service automatically detects and converts various image formats to PNG or JPEG before storing them in the system. This ensures that all images are in a standard, web-compatible format.
Supported Input Formats
Direct Storage (No Preprocessing)
- PNG (
image/png
) - Already optimal format - JPEG (
image/jpeg
,image/jpg
) - Already optimal format
Formats Requiring Preprocessing
HEIC/HEIF Files
- Input: HEIC/HEIF files from modern smartphones
- Processing: Convert to RGB and flatten alpha channel
- Output: PNG or JPEG
WebP Files
- Input: WebP format (Google's web image format)
- Processing: Convert to RGB and flatten alpha channel
- Output: PNG or JPEG
GIF Files
- Input: GIF files (static or animated)
- Processing: Extract first frame for animated GIFs, convert to RGB
- Output: PNG or JPEG
TIFF/GeoTIFF Files
- Input: TIFF or GeoTIFF files
- Processing: Render RGB view, handle various color spaces
- Output: PNG or JPEG
PDF Files
- Input: PDF documents
- Processing: Rasterize first page at 2x zoom for quality
- Output: PNG or JPEG
- Performance Note: PDF processing is inherently slower due to complex format parsing and rasterization
How It Works
1. MIME Type Detection
The service first detects the file format using:
- File extension analysis
- File signature (magic bytes) detection
- Fallback to generic binary if unknown
2. Preprocessing Decision
- If format is already PNG/JPEG β No processing needed
- If format requires conversion β Apply appropriate processor
3. Format Conversion
Each format has a specialized processor that:
- Opens the file using appropriate library (PIL, PyMuPDF)
- Converts to RGB color space
- Flattens alpha channels
- Optimizes output quality
- Generates new filename with correct extension
4. Storage
- Processed image is stored with new filename
- Original filename is preserved in metadata
- SHA256 hash is calculated from processed content
Integration Points
Upload Endpoint (/api/images/
)
- All file uploads go through preprocessing
- Supports drag & drop and file picker
- Handles both crisis maps and drone imagery
Contribution Endpoint (/api/contribute/from-url
)
- Images contributed from existing URLs are also preprocessed
- Ensures consistency across all image sources
Configuration
Target Format
- Default: PNG (better quality, lossless)
- Alternative: JPEG (smaller file size, lossy)
- Quality: 95% for JPEG (configurable)
Error Handling
- If preprocessing fails, falls back to original content
- Logs errors for debugging
- Continues upload process
Dependencies
- Pillow (PIL): Core image processing
- PyMuPDF: PDF rasterization
- Python standard library: MIME type detection, file handling
Benefits
- Format Consistency: All stored images are in web-compatible formats
- Quality Assurance: Automatic optimization and color space conversion
- User Experience: Users can upload any common image format
- Storage Efficiency: Optimized file sizes and formats
- Compatibility: Ensures images work across all platforms and browsers
Example Usage
from app.services.image_preprocessor import ImagePreprocessor
# Process an image
processed_content, new_filename, mime_type = ImagePreprocessor.preprocess_image(
file_content,
"original.heic",
target_format='PNG',
quality=95
)
# Check if preprocessing is needed
if ImagePreprocessor.needs_preprocessing(mime_type):
print(f"Converting {mime_type} to PNG...")
Error Handling
The service gracefully handles errors:
- Unsupported formats: Falls back to generic processing
- Corrupted files: Logs error and continues with original
- Processing failures: Maintains upload functionality
- Memory issues: Handles large files efficiently
Performance Considerations
PDF Processing Performance
PDF conversion is the most computationally expensive operation due to:
- Complex Format: PDFs require parsing, interpretation, and rendering
- Rasterization: Vector-to-pixel conversion is CPU-intensive
- Memory Usage: Large PDFs can consume significant memory
- Quality vs Speed: Higher zoom factors increase quality but decrease speed
Performance Tuning Options
from app.services.image_preprocessor import ImagePreprocessor
# Fast mode - lower quality, much faster
ImagePreprocessor.configure_pdf_processing(quality_mode='fast')
# Balanced mode - good quality, reasonable speed (default)
ImagePreprocessor.configure_pdf_processing(quality_mode='balanced')
# Quality mode - highest quality, slower processing
ImagePreprocessor.configure_pdf_processing(quality_mode='quality')
# Custom configuration
ImagePreprocessor.configure_pdf_processing(
zoom_factor=1.2, # Lower zoom = faster
compress_level=4, # Lower compression = faster
quality_mode='balanced'
)
Expected Processing Times
- Small PDFs (<1MB): 2-5 seconds
- Medium PDFs (1-5MB): 5-15 seconds
- Large PDFs (5-25MB): 15-60 seconds
- Complex PDFs: May take longer due to graphics complexity
Future Enhancements
- Batch processing: Process multiple images simultaneously
- Format preferences: User-configurable output formats
- Quality settings: Adjustable compression levels
- Metadata preservation: Keep EXIF and other metadata
- Progressive processing: Stream large files