metadata

title: NovaEval by Noveum.ai
emoji: ⚡
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false

NovaEval by Noveum.ai

Advanced AI Model Evaluation Platform powered by Hugging Face Models

🚀 Features

🤖 Comprehensive Model Selection

15+ Top Hugging Face Models across different size categories
Real-time Model Search with provider filtering
Detailed Model Information including capabilities, size, and provider
Size-based Filtering (Small 1-3B, Medium 7B, Large 14B+)

📊 Rich Dataset Collection

11 Evaluation Datasets covering reasoning, knowledge, math, code, and language
Category-based Filtering for easy dataset discovery
Detailed Dataset Information including sample counts and difficulty levels
Popular Benchmarks like MMLU, HellaSwag, GSM8K, HumanEval

⚡ Advanced Evaluation Engine

Real-time Progress Tracking with WebSocket updates
Live Evaluation Logs showing detailed request/response data
Multiple Metrics Support (Accuracy, F1-Score, BLEU, ROUGE, Pass@K)
Configurable Parameters (sample size, temperature, max tokens)

🎨 Modern User Interface

Responsive Design optimized for desktop and mobile
Interactive Model Cards with hover effects and selection states
Real-time Configuration with sliders and checkboxes
Professional Gradient Design with smooth animations

🔧 Technical Stack

Backend: FastAPI + Python 3.11
Frontend: HTML5 + Tailwind CSS + Vanilla JavaScript
Real-time: WebSocket for live updates
Models: Hugging Face Inference API (free tier)
Deployment: Docker + Hugging Face Spaces

📋 Available Models

Small Models (1-3B)

FLAN-T5 Large (0.8B) - Google
Qwen 2.5 3B (3B) - Alibaba
Gemma 2B (2B) - Google

Medium Models (7B)

Qwen 2.5 7B (7B) - Alibaba
Mistral 7B (7B) - Mistral AI
DialoGPT Medium (345M) - Microsoft
CodeLlama 7B Python (7B) - Meta

Large Models (14B+)

Qwen 2.5 14B (14B) - Alibaba
Qwen 2.5 32B (32B) - Alibaba
Qwen 2.5 72B (72B) - Alibaba

📊 Available Datasets

Reasoning

HellaSwag - Commonsense reasoning (60K samples)
CommonsenseQA - Reasoning questions (12.1K samples)
ARC - Science reasoning (7.8K samples)

Knowledge

MMLU - Multitask understanding (231K samples)
BoolQ - Reading comprehension (12.7K samples)

Math

GSM8K - Grade school math (17.6K samples)
AQUA-RAT - Algebraic reasoning (196K samples)

Code

HumanEval - Python code generation (164 samples)
MBPP - Basic Python problems (1.4K samples)

Language

IMDB Reviews - Sentiment analysis (100K samples)
CNN/DailyMail - Summarization (936K samples)

🎯 Evaluation Metrics

Accuracy - Percentage of correct predictions
F1 Score - Harmonic mean of precision and recall
BLEU Score - Text generation quality
ROUGE Score - Summarization quality
Pass@K - Code generation success rate

🚀 Quick Start

Option 1: Direct Upload to Hugging Face Spaces

Create a new Space on Hugging Face
Choose "Docker" as the SDK
Upload these files:
- app.py (renamed from advanced_novaeval_app.py)
- requirements.txt
- Dockerfile
- README.md
Commit and push - your Space will build automatically!

Option 2: Local Development

# Install dependencies
pip install -r requirements.txt

# Run the application
python advanced_novaeval_app.py

# Open browser to http://localhost:7860

🔧 Configuration Options

Model Parameters

Sample Size: 10-1000 samples
Temperature: 0.0-2.0 (creativity control)
Max Tokens: 128-2048 (response length)
Top-p: 0.9 (nucleus sampling)

Evaluation Settings

Multiple Model Selection: Compare up to 10 models
Flexible Metrics: Choose relevant metrics for your task
Real-time Monitoring: Watch evaluations progress live
Export Results: Download results in JSON format

📱 User Experience

Workflow

Select Models - Choose from 15+ Hugging Face models
Pick Dataset - Select from 11 evaluation datasets
Configure Metrics - Choose relevant evaluation metrics
Set Parameters - Adjust sample size, temperature, etc.
Start Evaluation - Watch real-time progress and logs
View Results - Analyze performance comparisons

Features

Model Search - Find models by name or provider
Category Filtering - Filter by model size or dataset type
Real-time Logs - See actual evaluation steps
Progress Tracking - Visual progress bars and percentages
Interactive Results - Compare models side-by-side

🌟 Why NovaEval?

For Researchers

Comprehensive Benchmarking across multiple models and datasets
Standardized Evaluation with consistent metrics and procedures
Real-time Monitoring to track evaluation progress
Export Capabilities for further analysis

For Developers

Easy Integration with Hugging Face ecosystem
No API Keys Required - uses free HF Inference API
Modern Interface with responsive design
Detailed Logging for debugging and analysis

For Teams

Collaborative Evaluation with shareable results
Professional Interface suitable for presentations
Comprehensive Documentation for easy onboarding
Open Source with full customization capabilities

🔗 Links

Noveum.ai: https://noveum.ai
NovaEval Framework: https://github.com/Noveum/NovaEval
Hugging Face Models: https://huggingface.co/models
Documentation: Available in the application interface

📄 License

This project is open source and available under the MIT License.

🤝 Contributing

We welcome contributions! Please see our contributing guidelines for more information.

Built with ❤️ by Noveum.ai - Advancing AI Evaluation