NovaEval / README.md
shashankagar's picture
Update README.md
87d5dfe verified
---
title: NovaEval by Noveum.ai
emoji:
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
---
# NovaEval by Noveum.ai
Advanced AI Model Evaluation Platform powered by Hugging Face Models
## 🚀 Features
### 🤖 **Comprehensive Model Selection**
- **15+ Top Hugging Face Models** across different size categories
- **Real-time Model Search** with provider filtering
- **Detailed Model Information** including capabilities, size, and provider
- **Size-based Filtering** (Small 1-3B, Medium 7B, Large 14B+)
### 📊 **Rich Dataset Collection**
- **11 Evaluation Datasets** covering reasoning, knowledge, math, code, and language
- **Category-based Filtering** for easy dataset discovery
- **Detailed Dataset Information** including sample counts and difficulty levels
- **Popular Benchmarks** like MMLU, HellaSwag, GSM8K, HumanEval
### ⚡ **Advanced Evaluation Engine**
- **Real-time Progress Tracking** with WebSocket updates
- **Live Evaluation Logs** showing detailed request/response data
- **Multiple Metrics Support** (Accuracy, F1-Score, BLEU, ROUGE, Pass@K)
- **Configurable Parameters** (sample size, temperature, max tokens)
### 🎨 **Modern User Interface**
- **Responsive Design** optimized for desktop and mobile
- **Interactive Model Cards** with hover effects and selection states
- **Real-time Configuration** with sliders and checkboxes
- **Professional Gradient Design** with smooth animations
## 🔧 **Technical Stack**
- **Backend**: FastAPI + Python 3.11
- **Frontend**: HTML5 + Tailwind CSS + Vanilla JavaScript
- **Real-time**: WebSocket for live updates
- **Models**: Hugging Face Inference API (free tier)
- **Deployment**: Docker + Hugging Face Spaces
## 📋 **Available Models**
### Small Models (1-3B)
- **FLAN-T5 Large** (0.8B) - Google
- **Qwen 2.5 3B** (3B) - Alibaba
- **Gemma 2B** (2B) - Google
### Medium Models (7B)
- **Qwen 2.5 7B** (7B) - Alibaba
- **Mistral 7B** (7B) - Mistral AI
- **DialoGPT Medium** (345M) - Microsoft
- **CodeLlama 7B Python** (7B) - Meta
### Large Models (14B+)
- **Qwen 2.5 14B** (14B) - Alibaba
- **Qwen 2.5 32B** (32B) - Alibaba
- **Qwen 2.5 72B** (72B) - Alibaba
## 📊 **Available Datasets**
### Reasoning
- **HellaSwag** - Commonsense reasoning (60K samples)
- **CommonsenseQA** - Reasoning questions (12.1K samples)
- **ARC** - Science reasoning (7.8K samples)
### Knowledge
- **MMLU** - Multitask understanding (231K samples)
- **BoolQ** - Reading comprehension (12.7K samples)
### Math
- **GSM8K** - Grade school math (17.6K samples)
- **AQUA-RAT** - Algebraic reasoning (196K samples)
### Code
- **HumanEval** - Python code generation (164 samples)
- **MBPP** - Basic Python problems (1.4K samples)
### Language
- **IMDB Reviews** - Sentiment analysis (100K samples)
- **CNN/DailyMail** - Summarization (936K samples)
## 🎯 **Evaluation Metrics**
- **Accuracy** - Percentage of correct predictions
- **F1 Score** - Harmonic mean of precision and recall
- **BLEU Score** - Text generation quality
- **ROUGE Score** - Summarization quality
- **Pass@K** - Code generation success rate
## 🚀 **Quick Start**
### Option 1: Direct Upload to Hugging Face Spaces
1. Create a new Space on Hugging Face
2. Choose "Docker" as the SDK
3. Upload these files:
- `app.py` (renamed from `advanced_novaeval_app.py`)
- `requirements.txt`
- `Dockerfile`
- `README.md`
4. Commit and push - your Space will build automatically!
### Option 2: Local Development
```bash
# Install dependencies
pip install -r requirements.txt
# Run the application
python advanced_novaeval_app.py
# Open browser to http://localhost:7860
```
## 🔧 **Configuration Options**
### Model Parameters
- **Sample Size**: 10-1000 samples
- **Temperature**: 0.0-2.0 (creativity control)
- **Max Tokens**: 128-2048 (response length)
- **Top-p**: 0.9 (nucleus sampling)
### Evaluation Settings
- **Multiple Model Selection**: Compare up to 10 models
- **Flexible Metrics**: Choose relevant metrics for your task
- **Real-time Monitoring**: Watch evaluations progress live
- **Export Results**: Download results in JSON format
## 📱 **User Experience**
### Workflow
1. **Select Models** - Choose from 15+ Hugging Face models
2. **Pick Dataset** - Select from 11 evaluation datasets
3. **Configure Metrics** - Choose relevant evaluation metrics
4. **Set Parameters** - Adjust sample size, temperature, etc.
5. **Start Evaluation** - Watch real-time progress and logs
6. **View Results** - Analyze performance comparisons
### Features
- **Model Search** - Find models by name or provider
- **Category Filtering** - Filter by model size or dataset type
- **Real-time Logs** - See actual evaluation steps
- **Progress Tracking** - Visual progress bars and percentages
- **Interactive Results** - Compare models side-by-side
## 🌟 **Why NovaEval?**
### For Researchers
- **Comprehensive Benchmarking** across multiple models and datasets
- **Standardized Evaluation** with consistent metrics and procedures
- **Real-time Monitoring** to track evaluation progress
- **Export Capabilities** for further analysis
### For Developers
- **Easy Integration** with Hugging Face ecosystem
- **No API Keys Required** - uses free HF Inference API
- **Modern Interface** with responsive design
- **Detailed Logging** for debugging and analysis
### For Teams
- **Collaborative Evaluation** with shareable results
- **Professional Interface** suitable for presentations
- **Comprehensive Documentation** for easy onboarding
- **Open Source** with full customization capabilities
## 🔗 **Links**
- **Noveum.ai**: [https://noveum.ai](https://noveum.ai)
- **NovaEval Framework**: [https://github.com/Noveum/NovaEval](https://github.com/Noveum/NovaEval)
- **Hugging Face Models**: [https://huggingface.co/models](https://huggingface.co/models)
- **Documentation**: Available in the application interface
## 📄 **License**
This project is open source and available under the MIT License.
## 🤝 **Contributing**
We welcome contributions! Please see our contributing guidelines for more information.
---
**Built with ❤️ by [Noveum.ai](https://noveum.ai) - Advancing AI Evaluation**