Spaces:

Kiruthick18
/

PDF_Summarizer

Running

App Files Files Community

Update README.md

by harikumar87 - opened 28 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+293

-292

Files changed (1) hide show

README.md +293 -292

README.md CHANGED Viewed

@@ -1,293 +1,294 @@
----
-title: AI PDF Summarizer
-emoji: 📄
-colorFrom: blue
-colorTo: purple
-sdk: gradio
-sdk_version: 5.32.0
-app_file: app.py
-pinned: false
-license: mit
-thumbnail: >-
-  https://cdn-uploads.huggingface.co/production/uploads/6474405f90330355db146c76/uCiC_ILzv0UUhGHSOBVzJ.jpeg
-short_description: An intelligent PDF document summarizer.
----
-# ⚡ Lightning PDF Summarizer
-**Ultra-fast AI-powered PDF summarization** with intelligent text processing and beautiful interface.
-![Python](https://img.shields.io/badge/python-v3.10+-blue.svg)
-![Gradio](https://img.shields.io/badge/gradio-v4.44+-green.svg)
-![Transformers](https://img.shields.io/badge/transformers-v4.30+-orange.svg)
-![License](https://img.shields.io/badge/license-MIT-blue.svg)
-## 🚀 Features
-### ⚡ **Lightning Fast Performance**
-- **Ultra-fast DistilBART model** - 6x smaller than BART-Large (400MB vs 1.6GB)
-- **Optimized processing** - Smart chunking with 5-15 second processing times
-- **GPU acceleration** - Automatic CUDA detection and optimization
-- **Memory efficient** - Processes large PDFs without memory issues
-### 🎯 **Smart Summarization**
-- **3 Summary Modes**: Brief (Quick), Detailed, Comprehensive
-- **Intelligent chunking** - Respects sentence boundaries for coherent summaries
-- **Quality optimization** - DistilBART maintains 95% of BART-Large quality
-- **Multi-page support** - Handles documents from 1-1000+ pages
-### 📊 **Rich Analytics**
-- **Document statistics** - Word count, page count, character analysis
-- **Compression ratios** - See how much your document was condensed
-- **Processing insights** - Real-time chunk processing updates
-- **Quality metrics** - Summary length and efficiency stats
-### 🎨 **Beautiful Interface**
-- **Modern design** - Clean, professional Gradio interface
-- **Real-time feedback** - Live status updates and progress tracking
-- **Mobile responsive** - Works perfectly on all devices
-- **Intuitive UX** - Drag-and-drop PDF upload with instant processing
-## 📈 **Performance Benchmarks**
-| Document Size | Processing Time | Memory Usage | Quality Score |
-|---------------|----------------|--------------|---------------|
-| 1-5 pages     | 3-8 seconds    | ~200MB       | 95%           |
-| 5-20 pages    | 8-15 seconds   | ~400MB       | 94%           |
-| 20-50 pages   | 15-30 seconds  | ~600MB       | 93%           |
-| 50+ pages     | 30-60 seconds  | ~800MB       | 92%           |
-## 🛠️ **Technical Architecture**
-### **Core Components**
-- **Model**: `sshleifer/distilbart-cnn-12-6` (DistilBART)
-- **Framework**: Hugging Face Transformers + PyTorch
-- **Interface**: Gradio 4.44+ with custom CSS styling
-- **PDF Processing**: PyPDF2 with intelligent text extraction
-### **Optimization Techniques**
-- **Smart Chunking**: 512-word chunks with sentence boundary respect
-- **Beam Search**: Reduced to 2 beams for faster inference
-- **Early Stopping**: Prevents unnecessary computation
-- **Float16 Precision**: GPU optimization when available
-- **Limited Processing**: Max 5 chunks to prevent timeouts
-### **Quality Assurance**
-- **Error Handling**: Robust exception management
-- **Fallback Systems**: Automatic model fallback if loading fails
-- **Input Validation**: PDF format and content verification
-- **Memory Management**: Efficient chunk processing and cleanup
-## 🎯 **Use Cases**
-### **Academic & Research**
-- Research paper summarization
-- Literature review assistance
-- Thesis and dissertation analysis
-- Conference paper quick reviews
-### **Business & Professional**
-- Report summarization
-- Contract key points extraction
-- Meeting minutes condensation
-- Policy document analysis
-### **Educational**
-- Textbook chapter summaries
-- Study guide creation
-- Course material review
-- Assignment research
-### **Personal**
-- Book summarization
-- Article condensation
-- Document organization
-- Information extraction
-## 🚀 **Quick Start**
-### **Option 1: Use Online (Recommended)**
-1. Visit the [Hugging Face Space](https://huggingface.co/spaces/[your-username]/lightning-pdf-summarizer)
-2. Upload your PDF file
-3. Select summary length
-4. Get instant results!
-### **Option 2: Local Deployment**
-```bash
-# Clone the repository
-git clone https://github.com/[your-username]/lightning-pdf-summarizer.git
-cd lightning-pdf-summarizer
-# Install dependencies
-pip install -r requirements.txt
-# Run the application
-python app.py
-```
-### **Option 3: Docker Deployment**
-```bash
-# Build the container
-docker build -t pdf-summarizer .
-# Run the container
-docker run -p 7860:7860 pdf-summarizer
-```
-## 📋 **Requirements**
-### **System Requirements**
-- **Python**: 3.10+
-- **RAM**: 2GB minimum, 4GB recommended
-- **Storage**: 1GB for model downloads
-- **GPU**: Optional but recommended (CUDA compatible)
-### **Dependencies**
-```
-gradio>=4.44.0          # Modern web interface
-transformers>=4.30.0    # Hugging Face models
-torch>=2.0.0           # PyTorch backend
-PyPDF2>=3.0.0          # PDF processing
-accelerate>=0.20.0     # GPU optimization
-optimum>=1.12.0        # Performance optimization
-```
-## 💡 **Pro Tips for Best Results**
-### **Document Preparation**
-- ✅ **Use text-based PDFs** (not scanned images)
-- ✅ **Clean formatting** produces better summaries
-- ✅ **English content** works best (optimized for English)
-- ✅ **500-10,000 words** is the sweet spot
-### **Summary Optimization**
-- 🚀 **Brief Mode**: Perfect for quick overviews (20-60 words)
-- 📊 **Detailed Mode**: Balanced summaries (40-100 words)
-- 📚 **Comprehensive Mode**: In-depth analysis (60-150 words)
-### **Performance Tips**
-- ⚡ **Smaller files** process faster
-- 🖥️ **GPU acceleration** significantly improves speed
-- 📱 **Mobile-friendly** - works on phones and tablets
-- 🔄 **Batch processing** for multiple documents
-## 🛠️ **Advanced Configuration**
-### **Custom Model Integration**
-```python
-# Replace with your preferred model
-self.model_name = "your-custom-model"
-```
-### **Chunk Size Optimization**
-```python
-# Adjust for your use case
-max_chunk_length = 512  # Increase for longer context
-max_chunks = 5          # Increase for larger documents
-```
-### **Summary Length Tuning**
-```python
-# Customize summary lengths
-summary_lengths = {
-    "brief": (20, 60),
-    "detailed": (40, 100),
-    "comprehensive": (60, 150)
-}
-```
-## 🐛 **Troubleshooting**
-### **Common Issues**
-**❌ "No text extracted"**
-- Ensure PDF has selectable text (not just images)
-- Try OCR preprocessing for scanned documents
-**❌ "Processing too slow"**
-- Use Brief mode for faster results
-- Check if GPU acceleration is available
-- Consider smaller document sections
-**❌ "Memory errors"**
-- Reduce chunk size in configuration
-- Process smaller documents
-- Restart the application
-**❌ "Model loading fails"**
-- Check internet connection for model download
-- Verify sufficient disk space (1GB+)
-- Try the fallback model option
-## 🤝 **Contributing**
-We welcome contributions! Here's how you can help:
-### **Bug Reports**
-- Use GitHub Issues with detailed descriptions
-- Include error messages and system info
-- Provide sample PDFs when possible
-### **Feature Requests**
-- Suggest new summarization models
-- Propose UI/UX improvements
-- Request new output formats
-### **Code Contributions**
-- Fork the repository
-- Create feature branches
-- Submit pull requests with tests
-- Follow PEP 8 style guidelines
-## 📊 **Roadmap**
-### **Version 2.0** (Coming Soon)
-- [ ] Multi-language support (Spanish, French, German)
-- [ ] Batch processing for multiple PDFs
-- [ ] Custom summary templates
-- [ ] Export options (Word, Markdown, JSON)
-### **Version 2.1**
-- [ ] OCR integration for scanned PDFs
-- [ ] Advanced chunking strategies
-- [ ] Summary quality scoring
-- [ ] API endpoint for developers
-### **Version 3.0**
-- [ ] Question-answering interface
-- [ ] Document comparison features
-- [ ] Integration with cloud storage
-- [ ] Enterprise deployment options
-## 📄 **License**
-This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
-## 🙏 **Acknowledgments**
-- **Hugging Face** - For the amazing Transformers library and model hosting
-- **Facebook AI** - For the original BART architecture
-- **Gradio Team** - For the fantastic web interface framework
-- **PyPDF2 Contributors** - For reliable PDF processing
-- **Open Source Community** - For continuous improvements and feedback
-## 📞 **Support**
-### **Get Help**
-- 📧 **Email**: [your-email@domain.com]
-- 💬 **Discord**: [Your Discord Server]
-- 🐛 **Issues**: [GitHub Issues](https://github.com/[your-username]/lightning-pdf-summarizer/issues)
-- 📖 **Documentation**: [Full Docs](https://github.com/[your-username]/lightning-pdf-summarizer/wiki)
-### **Community**
-- ⭐ **Star this repo** if you find it useful!
-- 🔄 **Share** with colleagues and friends
-- 🤝 **Contribute** to make it even better
-- 📢 **Follow** for updates and new features
----
-**Made with ❤️ by [Your Name]**
 *Transform your document reading experience with Lightning PDF Summarizer!*

+---
+title: AI PDF Summarizer
+emoji: 📄
+colorFrom: blue
+colorTo: purple
+sdk: gradio
+sdk_version: 5.32.0
+app_file: app.py
+pinned: false
+license: mit
+thumbnail: >-
+  https://cdn-uploads.huggingface.co/production/uploads/6474405f90330355db146c76/uCiC_ILzv0UUhGHSOBVzJ.jpeg
+short_description: An intelligent PDF document summarizer.
+---
+# ⚡ Lightning PDF Summarizer
+**Ultra-fast AI-powered PDF summarization** with intelligent text processing and beautiful interface.
+![Python](https://img.shields.io/badge/python-v3.10+-blue.svg)
+![Gradio](https://img.shields.io/badge/gradio-v4.44+-green.svg)
+![Transformers](https://img.shields.io/badge/transformers-v4.30+-orange.svg)
+![License](https://img.shields.io/badge/license-MIT-blue.svg)
+## 🚀 Features
+### ⚡ **Lightning Fast Performance**
+- **Ultra-fast DistilBART model** - 6x smaller than BART-Large (400MB vs 1.6GB)
+- **Optimized processing** - Smart chunking with 5-15 second processing times
+- **GPU acceleration** - Automatic CUDA detection and optimization
+- **Memory efficient** - Processes large PDFs without memory issues
+### 🎯 **Smart Summarization**
+- **3 Summary Modes**: Brief (Quick), Detailed, Comprehensive
+- **Intelligent chunking** - Respects sentence boundaries for coherent summaries
+- **Quality optimization** - DistilBART maintains 95% of BART-Large quality
+- **Multi-page support** - Handles documents from 1-1000+ pages
+### 📊 **Rich Analytics**
+- **Document statistics** - Word count, page count, character analysis
+- **Compression ratios** - See how much your document was condensed
+- **Processing insights** - Real-time chunk processing updates
+- **Quality metrics** - Summary length and efficiency stats
+### 🎨 **Beautiful Interface**
+- **Modern design** - Clean, professional Gradio interface
+- **Real-time feedback** - Live status updates and progress tracking
+- **Mobile responsive** - Works perfectly on all devices
+- **Intuitive UX** - Drag-and-drop PDF upload with instant processing
+## 📈 **Performance Benchmarks**
+| Document Size | Processing Time | Memory Usage | Quality Score |
+|---------------|----------------|--------------|---------------|
+| 1-5 pages     | 3-8 seconds    | ~200MB       | 95%           |
+| 5-20 pages    | 8-15 seconds   | ~400MB       | 94%           |
+| 20-50 pages   | 15-30 seconds  | ~600MB       | 93%           |
+| 50+ pages     | 30-60 seconds  | ~800MB       | 92%           |
+## 🛠️ **Technical Architecture**
+### **Core Components**
+- **Model**: `sshleifer/distilbart-cnn-12-6` (DistilBART)
+- **Framework**: Hugging Face Transformers + PyTorch
+- **Interface**: Gradio 4.44+ with custom CSS styling
+- **PDF Processing**: PyPDF2 with intelligent text extraction
+### **Optimization Techniques**
+- **Smart Chunking**: 512-word chunks with sentence boundary respect
+- **Beam Search**: Reduced to 2 beams for faster inference
+- **Early Stopping**: Prevents unnecessary computation
+- **Float16 Precision**: GPU optimization when available
+- **Limited Processing**: Max 5 chunks to prevent timeouts
+### **Quality Assurance**
+- **Error Handling**: Robust exception management
+- **Fallback Systems**: Automatic model fallback if loading fails
+- **Input Validation**: PDF format and content verification
+- **Memory Management**: Efficient chunk processing and cleanup
+## 🎯 **Use Cases**
+### **Academic & Research**
+- Research paper summarization
+- Literature review assistance
+- Thesis and dissertation analysis
+- Conference paper quick reviews
+### **Business & Professional**
+- Report summarization
+- Contract key points extraction
+- Meeting minutes condensation
+- Policy document analysis
+### **Educational**
+- Textbook chapter summaries
+- Study guide creation
+- Course material review
+- Assignment research
+### **Personal**
+- Book summarization
+- Article condensation
+- Document organization
+- Information extraction
+## 🚀 **Quick Start**
+### **Option 1: Use Online (Recommended)**
+1. Visit the [Hugging Face Space](https://huggingface.co/spaces/[your-username]/lightning-pdf-summarizer)
+2. Upload your PDF file
+3. Select summary length
+4. Get instant results!
+### **Option 2: Local Deployment**
+```bash
+# Clone the repository
+git clone https://github.com/[your-username]/lightning-pdf-summarizer.git
+cd lightning-pdf-summarizer
+# Install dependencies
+pip install -r requirements.txt
+# Run the application
+python app.py
+```
+### **Option 3: Docker Deployment**
+```bash
+# Build the container
+docker build -t pdf-summarizer .
+# Run the container
+docker run -p 7860:7860 pdf-summarizer
+```
+## 📋 **Requirements**
+### **System Requirements**
+- **Python**: 3.10+
+- **RAM**: 2GB minimum, 4GB recommended
+- **Storage**: 1GB for model downloads
+- **GPU**: Optional but recommended (CUDA compatible)
+### **Dependencies**
+```
+gradio>=4.44.0          # Modern web interface
+transformers>=4.30.0    # Hugging Face models
+torch>=2.0.0           # PyTorch backend
+PyPDF2>=3.0.0          # PDF processing
+accelerate>=0.20.0     # GPU optimization
+optimum>=1.12.0        # Performance optimization
+```
+## 💡 **Pro Tips for Best Results**
+### **Document Preparation**
+- ✅ **Use text-based PDFs** (not scanned images)
+- ✅ **Clean formatting** produces better summaries
+- ✅ **English content** works best (optimized for English)
+- ✅ **500-10,000 words** is the sweet spot
+### **Summary Optimization**
+- 🚀 **Brief Mode**: Perfect for quick overviews (20-60 words)
+- 📊 **Detailed Mode**: Balanced summaries (40-100 words)
+- 📚 **Comprehensive Mode**: In-depth analysis (60-150 words)
+### **Performance Tips**
+- ⚡ **Smaller files** process faster
+- 🖥️ **GPU acceleration** significantly improves speed
+- 📱 **Mobile-friendly** - works on phones and tablets
+- 🔄 **Batch processing** for multiple documents
+## 🛠️ **Advanced Configuration**
+### **Custom Model Integration**
+```python
+# Replace with your preferred model
+self.model_name = "your-custom-model"
+```
+### **Chunk Size Optimization**
+```python
+# Adjust for your use case
+max_chunk_length = 512  # Increase for longer context
+max_chunks = 5          # Increase for larger documents
+```
+### **Summary Length Tuning**
+```python
+# Customize summary lengths
+summary_lengths = {
+    "brief": (20, 60),
+    "detailed": (40, 100),
+    "comprehensive": (60, 150)
+}
+```
+## 🐛 **Troubleshooting**
+### **Common Issues**
+**❌ "No text extracted"**
+- Ensure PDF has selectable text (not just images)
+- Try OCR preprocessing for scanned documents
+**❌ "Processing too slow"**
+- Use Brief mode for faster results
+- Check if GPU acceleration is available
+- Consider smaller document sections
+**❌ "Memory errors"**
+- Reduce chunk size in configuration
+- Process smaller documents
+- Restart the application
+**❌ "Model loading fails"**
+- Check internet connection for model download
+- Verify sufficient disk space (1GB+)
+- Try the fallback model option
+## 🤝 **Contributing**
+We welcome contributions! Here's how you can help:
+### **Bug Reports**
+- Use GitHub Issues with detailed descriptions
+- Include error messages and system info
+- Provide sample PDFs when possible
+### **Feature Requests**
+- Suggest new summarization models
+- Propose UI/UX improvements
+- Request new output formats
+### **Code Contributions**
+- Fork the repository
+- Create feature branches
+- Submit pull requests with tests
+- Follow PEP 8 style guidelines
+## 📊 **Roadmap**
+### **Version 2.0** (Coming Soon)
+- [ ] Multi-language support (Spanish, French, German)
+- [ ] Batch processing for multiple PDFs
+- [ ] Custom summary templates
+- [ ] Export options (Word, Markdown, JSON)
+### **Version 2.1**
+- [ ] OCR integration for scanned PDFs
+- [ ] Advanced chunking strategies
+- [ ] Summary quality scoring
+- [ ] API endpoint for developers
+### **Version 3.0**
+- [ ] Question-answering interface
+- [ ] Document comparison features
+- [ ] Integration with cloud storage
+- [ ] Enterprise deployment options
+## 📄 **License**
+This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
+## 🙏 **Acknowledgments**
+- **Hugging Face** - For the amazing Transformers library and model hosting
+- **Facebook AI** - For the original BART architecture
+- **Gradio Team** - For the fantastic web interface framework
+- **PyPDF2 Contributors** - For reliable PDF processing
+- **Open Source Community** - For continuous improvements and feedback
+## 📞 **Support**
+### **Get Help**
+- 📧 **Email**: [your-email@domain.com]
+- 💬 **Discord**: [Your Discord Server]
+- 🐛 **Issues**: [GitHub Issues](https://github.com/[your-username]/lightning-pdf-summarizer/issues)
+- 📖 **Documentation**: [Full Docs](https://github.com/[your-username]/lightning-pdf-summarizer/wiki)
+### **Community**
+- ⭐ **Star this repo** if you find it useful!
+- 🔄 **Share** with colleagues and friends
+- 🤝 **Contribute** to make it even better
+- 📢 **Follow** for updates and new features
+---
+**Made with ❤️ by [Your Name]**
 *Transform your document reading experience with Lightning PDF Summarizer!*