# Image Preprocessing Service This service automatically processes various image formats during upload to ensure compatibility and optimal storage. ## Overview The `ImagePreprocessor` service automatically detects and converts various image formats to PNG or JPEG before storing them in the system. This ensures that all images are in a standard, web-compatible format. ## Supported Input Formats ### Direct Storage (No Preprocessing) - **PNG** (`image/png`) - Already optimal format - **JPEG** (`image/jpeg`, `image/jpg`) - Already optimal format ### Formats Requiring Preprocessing #### HEIC/HEIF Files - **Input**: HEIC/HEIF files from modern smartphones - **Processing**: Convert to RGB and flatten alpha channel - **Output**: PNG or JPEG #### WebP Files - **Input**: WebP format (Google's web image format) - **Processing**: Convert to RGB and flatten alpha channel - **Output**: PNG or JPEG #### GIF Files - **Input**: GIF files (static or animated) - **Processing**: Extract first frame for animated GIFs, convert to RGB - **Output**: PNG or JPEG #### TIFF/GeoTIFF Files - **Input**: TIFF or GeoTIFF files - **Processing**: Render RGB view, handle various color spaces - **Output**: PNG or JPEG #### PDF Files - **Input**: PDF documents - **Processing**: Rasterize first page at 2x zoom for quality - **Output**: PNG or JPEG - **Performance Note**: PDF processing is inherently slower due to complex format parsing and rasterization ## How It Works ### 1. MIME Type Detection The service first detects the file format using: - File extension analysis - File signature (magic bytes) detection - Fallback to generic binary if unknown ### 2. Preprocessing Decision - If format is already PNG/JPEG → No processing needed - If format requires conversion → Apply appropriate processor ### 3. Format Conversion Each format has a specialized processor that: - Opens the file using appropriate library (PIL, PyMuPDF) - Converts to RGB color space - Flattens alpha channels - Optimizes output quality - Generates new filename with correct extension ### 4. Storage - Processed image is stored with new filename - Original filename is preserved in metadata - SHA256 hash is calculated from processed content ## Integration Points ### Upload Endpoint (`/api/images/`) - All file uploads go through preprocessing - Supports drag & drop and file picker - Handles both crisis maps and drone imagery ### Contribution Endpoint (`/api/contribute/from-url`) - Images contributed from existing URLs are also preprocessed - Ensures consistency across all image sources ## Configuration ### Target Format - **Default**: PNG (better quality, lossless) - **Alternative**: JPEG (smaller file size, lossy) - **Quality**: 95% for JPEG (configurable) ### Error Handling - If preprocessing fails, falls back to original content - Logs errors for debugging - Continues upload process ## Dependencies - **Pillow (PIL)**: Core image processing - **PyMuPDF**: PDF rasterization - **Python standard library**: MIME type detection, file handling ## Benefits 1. **Format Consistency**: All stored images are in web-compatible formats 2. **Quality Assurance**: Automatic optimization and color space conversion 3. **User Experience**: Users can upload any common image format 4. **Storage Efficiency**: Optimized file sizes and formats 5. **Compatibility**: Ensures images work across all platforms and browsers ## Example Usage ```python from app.services.image_preprocessor import ImagePreprocessor # Process an image processed_content, new_filename, mime_type = ImagePreprocessor.preprocess_image( file_content, "original.heic", target_format='PNG', quality=95 ) # Check if preprocessing is needed if ImagePreprocessor.needs_preprocessing(mime_type): print(f"Converting {mime_type} to PNG...") ``` ## Error Handling The service gracefully handles errors: - **Unsupported formats**: Falls back to generic processing - **Corrupted files**: Logs error and continues with original - **Processing failures**: Maintains upload functionality - **Memory issues**: Handles large files efficiently ## Performance Considerations ### PDF Processing Performance PDF conversion is the most computationally expensive operation due to: - **Complex Format**: PDFs require parsing, interpretation, and rendering - **Rasterization**: Vector-to-pixel conversion is CPU-intensive - **Memory Usage**: Large PDFs can consume significant memory - **Quality vs Speed**: Higher zoom factors increase quality but decrease speed ### Performance Tuning Options ```python from app.services.image_preprocessor import ImagePreprocessor # Fast mode - lower quality, much faster ImagePreprocessor.configure_pdf_processing(quality_mode='fast') # Balanced mode - good quality, reasonable speed (default) ImagePreprocessor.configure_pdf_processing(quality_mode='balanced') # Quality mode - highest quality, slower processing ImagePreprocessor.configure_pdf_processing(quality_mode='quality') # Custom configuration ImagePreprocessor.configure_pdf_processing( zoom_factor=1.2, # Lower zoom = faster compress_level=4, # Lower compression = faster quality_mode='balanced' ) ``` ### Expected Processing Times - **Small PDFs (<1MB)**: 2-5 seconds - **Medium PDFs (1-5MB)**: 5-15 seconds - **Large PDFs (5-25MB)**: 15-60 seconds - **Complex PDFs**: May take longer due to graphics complexity ## Future Enhancements - **Batch processing**: Process multiple images simultaneously - **Format preferences**: User-configurable output formats - **Quality settings**: Adjustable compression levels - **Metadata preservation**: Keep EXIF and other metadata - **Progressive processing**: Stream large files