Spaces:
Sleeping
Sleeping
metadata
title: NovaEval by Noveum.ai
emoji: ⚡
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
NovaEval by Noveum.ai
Advanced AI Model Evaluation Platform powered by Hugging Face Models
🚀 Features
🤖 Comprehensive Model Selection
- 15+ Top Hugging Face Models across different size categories
- Real-time Model Search with provider filtering
- Detailed Model Information including capabilities, size, and provider
- Size-based Filtering (Small 1-3B, Medium 7B, Large 14B+)
📊 Rich Dataset Collection
- 11 Evaluation Datasets covering reasoning, knowledge, math, code, and language
- Category-based Filtering for easy dataset discovery
- Detailed Dataset Information including sample counts and difficulty levels
- Popular Benchmarks like MMLU, HellaSwag, GSM8K, HumanEval
⚡ Advanced Evaluation Engine
- Real-time Progress Tracking with WebSocket updates
- Live Evaluation Logs showing detailed request/response data
- Multiple Metrics Support (Accuracy, F1-Score, BLEU, ROUGE, Pass@K)
- Configurable Parameters (sample size, temperature, max tokens)
🎨 Modern User Interface
- Responsive Design optimized for desktop and mobile
- Interactive Model Cards with hover effects and selection states
- Real-time Configuration with sliders and checkboxes
- Professional Gradient Design with smooth animations
🔧 Technical Stack
- Backend: FastAPI + Python 3.11
- Frontend: HTML5 + Tailwind CSS + Vanilla JavaScript
- Real-time: WebSocket for live updates
- Models: Hugging Face Inference API (free tier)
- Deployment: Docker + Hugging Face Spaces
📋 Available Models
Small Models (1-3B)
- FLAN-T5 Large (0.8B) - Google
- Qwen 2.5 3B (3B) - Alibaba
- Gemma 2B (2B) - Google
Medium Models (7B)
- Qwen 2.5 7B (7B) - Alibaba
- Mistral 7B (7B) - Mistral AI
- DialoGPT Medium (345M) - Microsoft
- CodeLlama 7B Python (7B) - Meta
Large Models (14B+)
- Qwen 2.5 14B (14B) - Alibaba
- Qwen 2.5 32B (32B) - Alibaba
- Qwen 2.5 72B (72B) - Alibaba
📊 Available Datasets
Reasoning
- HellaSwag - Commonsense reasoning (60K samples)
- CommonsenseQA - Reasoning questions (12.1K samples)
- ARC - Science reasoning (7.8K samples)
Knowledge
- MMLU - Multitask understanding (231K samples)
- BoolQ - Reading comprehension (12.7K samples)
Math
- GSM8K - Grade school math (17.6K samples)
- AQUA-RAT - Algebraic reasoning (196K samples)
Code
- HumanEval - Python code generation (164 samples)
- MBPP - Basic Python problems (1.4K samples)
Language
- IMDB Reviews - Sentiment analysis (100K samples)
- CNN/DailyMail - Summarization (936K samples)
🎯 Evaluation Metrics
- Accuracy - Percentage of correct predictions
- F1 Score - Harmonic mean of precision and recall
- BLEU Score - Text generation quality
- ROUGE Score - Summarization quality
- Pass@K - Code generation success rate
🚀 Quick Start
Option 1: Direct Upload to Hugging Face Spaces
- Create a new Space on Hugging Face
- Choose "Docker" as the SDK
- Upload these files:
app.py
(renamed fromadvanced_novaeval_app.py
)requirements.txt
Dockerfile
README.md
- Commit and push - your Space will build automatically!
Option 2: Local Development
# Install dependencies
pip install -r requirements.txt
# Run the application
python advanced_novaeval_app.py
# Open browser to http://localhost:7860
🔧 Configuration Options
Model Parameters
- Sample Size: 10-1000 samples
- Temperature: 0.0-2.0 (creativity control)
- Max Tokens: 128-2048 (response length)
- Top-p: 0.9 (nucleus sampling)
Evaluation Settings
- Multiple Model Selection: Compare up to 10 models
- Flexible Metrics: Choose relevant metrics for your task
- Real-time Monitoring: Watch evaluations progress live
- Export Results: Download results in JSON format
📱 User Experience
Workflow
- Select Models - Choose from 15+ Hugging Face models
- Pick Dataset - Select from 11 evaluation datasets
- Configure Metrics - Choose relevant evaluation metrics
- Set Parameters - Adjust sample size, temperature, etc.
- Start Evaluation - Watch real-time progress and logs
- View Results - Analyze performance comparisons
Features
- Model Search - Find models by name or provider
- Category Filtering - Filter by model size or dataset type
- Real-time Logs - See actual evaluation steps
- Progress Tracking - Visual progress bars and percentages
- Interactive Results - Compare models side-by-side
🌟 Why NovaEval?
For Researchers
- Comprehensive Benchmarking across multiple models and datasets
- Standardized Evaluation with consistent metrics and procedures
- Real-time Monitoring to track evaluation progress
- Export Capabilities for further analysis
For Developers
- Easy Integration with Hugging Face ecosystem
- No API Keys Required - uses free HF Inference API
- Modern Interface with responsive design
- Detailed Logging for debugging and analysis
For Teams
- Collaborative Evaluation with shareable results
- Professional Interface suitable for presentations
- Comprehensive Documentation for easy onboarding
- Open Source with full customization capabilities
🔗 Links
- Noveum.ai: https://noveum.ai
- NovaEval Framework: https://github.com/Noveum/NovaEval
- Hugging Face Models: https://huggingface.co/models
- Documentation: Available in the application interface
📄 License
This project is open source and available under the MIT License.
🤝 Contributing
We welcome contributions! Please see our contributing guidelines for more information.
Built with ❤️ by Noveum.ai - Advancing AI Evaluation