NovaEval / README.md
shashankagar's picture
Update README.md
87d5dfe verified
metadata
title: NovaEval by Noveum.ai
emoji: 
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false

NovaEval by Noveum.ai

Advanced AI Model Evaluation Platform powered by Hugging Face Models

🚀 Features

🤖 Comprehensive Model Selection

  • 15+ Top Hugging Face Models across different size categories
  • Real-time Model Search with provider filtering
  • Detailed Model Information including capabilities, size, and provider
  • Size-based Filtering (Small 1-3B, Medium 7B, Large 14B+)

📊 Rich Dataset Collection

  • 11 Evaluation Datasets covering reasoning, knowledge, math, code, and language
  • Category-based Filtering for easy dataset discovery
  • Detailed Dataset Information including sample counts and difficulty levels
  • Popular Benchmarks like MMLU, HellaSwag, GSM8K, HumanEval

Advanced Evaluation Engine

  • Real-time Progress Tracking with WebSocket updates
  • Live Evaluation Logs showing detailed request/response data
  • Multiple Metrics Support (Accuracy, F1-Score, BLEU, ROUGE, Pass@K)
  • Configurable Parameters (sample size, temperature, max tokens)

🎨 Modern User Interface

  • Responsive Design optimized for desktop and mobile
  • Interactive Model Cards with hover effects and selection states
  • Real-time Configuration with sliders and checkboxes
  • Professional Gradient Design with smooth animations

🔧 Technical Stack

  • Backend: FastAPI + Python 3.11
  • Frontend: HTML5 + Tailwind CSS + Vanilla JavaScript
  • Real-time: WebSocket for live updates
  • Models: Hugging Face Inference API (free tier)
  • Deployment: Docker + Hugging Face Spaces

📋 Available Models

Small Models (1-3B)

  • FLAN-T5 Large (0.8B) - Google
  • Qwen 2.5 3B (3B) - Alibaba
  • Gemma 2B (2B) - Google

Medium Models (7B)

  • Qwen 2.5 7B (7B) - Alibaba
  • Mistral 7B (7B) - Mistral AI
  • DialoGPT Medium (345M) - Microsoft
  • CodeLlama 7B Python (7B) - Meta

Large Models (14B+)

  • Qwen 2.5 14B (14B) - Alibaba
  • Qwen 2.5 32B (32B) - Alibaba
  • Qwen 2.5 72B (72B) - Alibaba

📊 Available Datasets

Reasoning

  • HellaSwag - Commonsense reasoning (60K samples)
  • CommonsenseQA - Reasoning questions (12.1K samples)
  • ARC - Science reasoning (7.8K samples)

Knowledge

  • MMLU - Multitask understanding (231K samples)
  • BoolQ - Reading comprehension (12.7K samples)

Math

  • GSM8K - Grade school math (17.6K samples)
  • AQUA-RAT - Algebraic reasoning (196K samples)

Code

  • HumanEval - Python code generation (164 samples)
  • MBPP - Basic Python problems (1.4K samples)

Language

  • IMDB Reviews - Sentiment analysis (100K samples)
  • CNN/DailyMail - Summarization (936K samples)

🎯 Evaluation Metrics

  • Accuracy - Percentage of correct predictions
  • F1 Score - Harmonic mean of precision and recall
  • BLEU Score - Text generation quality
  • ROUGE Score - Summarization quality
  • Pass@K - Code generation success rate

🚀 Quick Start

Option 1: Direct Upload to Hugging Face Spaces

  1. Create a new Space on Hugging Face
  2. Choose "Docker" as the SDK
  3. Upload these files:
    • app.py (renamed from advanced_novaeval_app.py)
    • requirements.txt
    • Dockerfile
    • README.md
  4. Commit and push - your Space will build automatically!

Option 2: Local Development

# Install dependencies
pip install -r requirements.txt

# Run the application
python advanced_novaeval_app.py

# Open browser to http://localhost:7860

🔧 Configuration Options

Model Parameters

  • Sample Size: 10-1000 samples
  • Temperature: 0.0-2.0 (creativity control)
  • Max Tokens: 128-2048 (response length)
  • Top-p: 0.9 (nucleus sampling)

Evaluation Settings

  • Multiple Model Selection: Compare up to 10 models
  • Flexible Metrics: Choose relevant metrics for your task
  • Real-time Monitoring: Watch evaluations progress live
  • Export Results: Download results in JSON format

📱 User Experience

Workflow

  1. Select Models - Choose from 15+ Hugging Face models
  2. Pick Dataset - Select from 11 evaluation datasets
  3. Configure Metrics - Choose relevant evaluation metrics
  4. Set Parameters - Adjust sample size, temperature, etc.
  5. Start Evaluation - Watch real-time progress and logs
  6. View Results - Analyze performance comparisons

Features

  • Model Search - Find models by name or provider
  • Category Filtering - Filter by model size or dataset type
  • Real-time Logs - See actual evaluation steps
  • Progress Tracking - Visual progress bars and percentages
  • Interactive Results - Compare models side-by-side

🌟 Why NovaEval?

For Researchers

  • Comprehensive Benchmarking across multiple models and datasets
  • Standardized Evaluation with consistent metrics and procedures
  • Real-time Monitoring to track evaluation progress
  • Export Capabilities for further analysis

For Developers

  • Easy Integration with Hugging Face ecosystem
  • No API Keys Required - uses free HF Inference API
  • Modern Interface with responsive design
  • Detailed Logging for debugging and analysis

For Teams

  • Collaborative Evaluation with shareable results
  • Professional Interface suitable for presentations
  • Comprehensive Documentation for easy onboarding
  • Open Source with full customization capabilities

🔗 Links

📄 License

This project is open source and available under the MIT License.

🤝 Contributing

We welcome contributions! Please see our contributing guidelines for more information.


Built with ❤️ by Noveum.ai - Advancing AI Evaluation