Spaces:

Noveumai
/

NovaEval

Sleeping

App Files Files Community

NovaEval / README.md

shashankagar

Update README.md

87d5dfe verified about 2 months ago

preview code

raw

history blame contribute delete

6.17 kB

	---
	title: NovaEval by Noveum.ai
	emoji: ⚡
	colorFrom: purple
	colorTo: blue
	sdk: docker
	pinned: false
	---

	# NovaEval by Noveum.ai

	Advanced AI Model Evaluation Platform powered by Hugging Face Models

	## 🚀 Features

	### 🤖 Comprehensive Model Selection
	- 15+ Top Hugging Face Models across different size categories
	- Real-time Model Search with provider filtering
	- Detailed Model Information including capabilities, size, and provider
	- Size-based Filtering (Small 1-3B, Medium 7B, Large 14B+)

	### 📊 Rich Dataset Collection
	- 11 Evaluation Datasets covering reasoning, knowledge, math, code, and language
	- Category-based Filtering for easy dataset discovery
	- Detailed Dataset Information including sample counts and difficulty levels
	- Popular Benchmarks like MMLU, HellaSwag, GSM8K, HumanEval

	### ⚡ Advanced Evaluation Engine
	- Real-time Progress Tracking with WebSocket updates
	- Live Evaluation Logs showing detailed request/response data
	- Multiple Metrics Support (Accuracy, F1-Score, BLEU, ROUGE, Pass@K)
	- Configurable Parameters (sample size, temperature, max tokens)

	### 🎨 Modern User Interface
	- Responsive Design optimized for desktop and mobile
	- Interactive Model Cards with hover effects and selection states
	- Real-time Configuration with sliders and checkboxes
	- Professional Gradient Design with smooth animations

	## 🔧 Technical Stack

	- Backend: FastAPI + Python 3.11
	- Frontend: HTML5 + Tailwind CSS + Vanilla JavaScript
	- Real-time: WebSocket for live updates
	- Models: Hugging Face Inference API (free tier)
	- Deployment: Docker + Hugging Face Spaces

	## 📋 Available Models

	### Small Models (1-3B)
	- FLAN-T5 Large (0.8B) - Google
	- Qwen 2.5 3B (3B) - Alibaba
	- Gemma 2B (2B) - Google

	### Medium Models (7B)
	- Qwen 2.5 7B (7B) - Alibaba
	- Mistral 7B (7B) - Mistral AI
	- DialoGPT Medium (345M) - Microsoft
	- CodeLlama 7B Python (7B) - Meta

	### Large Models (14B+)
	- Qwen 2.5 14B (14B) - Alibaba
	- Qwen 2.5 32B (32B) - Alibaba
	- Qwen 2.5 72B (72B) - Alibaba

	## 📊 Available Datasets

	### Reasoning
	- HellaSwag - Commonsense reasoning (60K samples)
	- CommonsenseQA - Reasoning questions (12.1K samples)
	- ARC - Science reasoning (7.8K samples)

	### Knowledge
	- MMLU - Multitask understanding (231K samples)
	- BoolQ - Reading comprehension (12.7K samples)

	### Math
	- GSM8K - Grade school math (17.6K samples)
	- AQUA-RAT - Algebraic reasoning (196K samples)

	### Code
	- HumanEval - Python code generation (164 samples)
	- MBPP - Basic Python problems (1.4K samples)

	### Language
	- IMDB Reviews - Sentiment analysis (100K samples)
	- CNN/DailyMail - Summarization (936K samples)

	## 🎯 Evaluation Metrics

	- Accuracy - Percentage of correct predictions
	- F1 Score - Harmonic mean of precision and recall
	- BLEU Score - Text generation quality
	- ROUGE Score - Summarization quality
	- Pass@K - Code generation success rate

	## 🚀 Quick Start

	### Option 1: Direct Upload to Hugging Face Spaces

	1. Create a new Space on Hugging Face
	2. Choose "Docker" as the SDK
	3. Upload these files:
	- `app.py` (renamed from `advanced_novaeval_app.py`)
	- `requirements.txt`
	- `Dockerfile`
	- `README.md`
	4. Commit and push - your Space will build automatically!

	### Option 2: Local Development

	```bash
	# Install dependencies
	pip install -r requirements.txt

	# Run the application
	python advanced_novaeval_app.py

	# Open browser to http://localhost:7860
	```

	## 🔧 Configuration Options

	### Model Parameters
	- Sample Size: 10-1000 samples
	- Temperature: 0.0-2.0 (creativity control)
	- Max Tokens: 128-2048 (response length)
	- Top-p: 0.9 (nucleus sampling)

	### Evaluation Settings
	- Multiple Model Selection: Compare up to 10 models
	- Flexible Metrics: Choose relevant metrics for your task
	- Real-time Monitoring: Watch evaluations progress live
	- Export Results: Download results in JSON format

	## 📱 User Experience

	### Workflow
	1. Select Models - Choose from 15+ Hugging Face models
	2. Pick Dataset - Select from 11 evaluation datasets
	3. Configure Metrics - Choose relevant evaluation metrics
	4. Set Parameters - Adjust sample size, temperature, etc.
	5. Start Evaluation - Watch real-time progress and logs
	6. View Results - Analyze performance comparisons

	### Features
	- Model Search - Find models by name or provider
	- Category Filtering - Filter by model size or dataset type
	- Real-time Logs - See actual evaluation steps
	- Progress Tracking - Visual progress bars and percentages
	- Interactive Results - Compare models side-by-side

	## 🌟 Why NovaEval?

	### For Researchers
	- Comprehensive Benchmarking across multiple models and datasets
	- Standardized Evaluation with consistent metrics and procedures
	- Real-time Monitoring to track evaluation progress
	- Export Capabilities for further analysis

	### For Developers
	- Easy Integration with Hugging Face ecosystem
	- No API Keys Required - uses free HF Inference API
	- Modern Interface with responsive design
	- Detailed Logging for debugging and analysis

	### For Teams
	- Collaborative Evaluation with shareable results
	- Professional Interface suitable for presentations
	- Comprehensive Documentation for easy onboarding
	- Open Source with full customization capabilities

	## 🔗 Links

	- Noveum.ai: [https://noveum.ai](https://noveum.ai)
	- NovaEval Framework: [https://github.com/Noveum/NovaEval](https://github.com/Noveum/NovaEval)
	- Hugging Face Models: [https://huggingface.co/models](https://huggingface.co/models)
	- Documentation: Available in the application interface

	## 📄 License

	This project is open source and available under the MIT License.

	## 🤝 Contributing

	We welcome contributions! Please see our contributing guidelines for more information.

	---

	Built with ❤️ by [Noveum.ai](https://noveum.ai) - Advancing AI Evaluation

	---
	title: NovaEval by Noveum.ai
	emoji: ⚡
	colorFrom: purple
	colorTo: blue
	sdk: docker
	pinned: false
	---

	# NovaEval by Noveum.ai

	Advanced AI Model Evaluation Platform powered by Hugging Face Models

	## 🚀 Features

	### 🤖 Comprehensive Model Selection
	- 15+ Top Hugging Face Models across different size categories
	- Real-time Model Search with provider filtering
	- Detailed Model Information including capabilities, size, and provider
	- Size-based Filtering (Small 1-3B, Medium 7B, Large 14B+)

	### 📊 Rich Dataset Collection
	- 11 Evaluation Datasets covering reasoning, knowledge, math, code, and language
	- Category-based Filtering for easy dataset discovery
	- Detailed Dataset Information including sample counts and difficulty levels
	- Popular Benchmarks like MMLU, HellaSwag, GSM8K, HumanEval

	### ⚡ Advanced Evaluation Engine
	- Real-time Progress Tracking with WebSocket updates
	- Live Evaluation Logs showing detailed request/response data
	- Multiple Metrics Support (Accuracy, F1-Score, BLEU, ROUGE, Pass@K)
	- Configurable Parameters (sample size, temperature, max tokens)

	### 🎨 Modern User Interface
	- Responsive Design optimized for desktop and mobile
	- Interactive Model Cards with hover effects and selection states
	- Real-time Configuration with sliders and checkboxes
	- Professional Gradient Design with smooth animations

	## 🔧 Technical Stack

	- Backend: FastAPI + Python 3.11
	- Frontend: HTML5 + Tailwind CSS + Vanilla JavaScript
	- Real-time: WebSocket for live updates
	- Models: Hugging Face Inference API (free tier)
	- Deployment: Docker + Hugging Face Spaces

	## 📋 Available Models

	### Small Models (1-3B)
	- FLAN-T5 Large (0.8B) - Google
	- Qwen 2.5 3B (3B) - Alibaba
	- Gemma 2B (2B) - Google

	### Medium Models (7B)
	- Qwen 2.5 7B (7B) - Alibaba
	- Mistral 7B (7B) - Mistral AI
	- DialoGPT Medium (345M) - Microsoft
	- CodeLlama 7B Python (7B) - Meta

	### Large Models (14B+)
	- Qwen 2.5 14B (14B) - Alibaba
	- Qwen 2.5 32B (32B) - Alibaba
	- Qwen 2.5 72B (72B) - Alibaba

	## 📊 Available Datasets

	### Reasoning
	- HellaSwag - Commonsense reasoning (60K samples)
	- CommonsenseQA - Reasoning questions (12.1K samples)
	- ARC - Science reasoning (7.8K samples)

	### Knowledge
	- MMLU - Multitask understanding (231K samples)
	- BoolQ - Reading comprehension (12.7K samples)

	### Math
	- GSM8K - Grade school math (17.6K samples)
	- AQUA-RAT - Algebraic reasoning (196K samples)

	### Code
	- HumanEval - Python code generation (164 samples)
	- MBPP - Basic Python problems (1.4K samples)

	### Language
	- IMDB Reviews - Sentiment analysis (100K samples)
	- CNN/DailyMail - Summarization (936K samples)

	## 🎯 Evaluation Metrics

	- Accuracy - Percentage of correct predictions
	- F1 Score - Harmonic mean of precision and recall
	- BLEU Score - Text generation quality
	- ROUGE Score - Summarization quality
	- Pass@K - Code generation success rate

	## 🚀 Quick Start

	### Option 1: Direct Upload to Hugging Face Spaces

	1. Create a new Space on Hugging Face
	2. Choose "Docker" as the SDK
	3. Upload these files:
	- `app.py` (renamed from `advanced_novaeval_app.py`)
	- `requirements.txt`
	- `Dockerfile`
	- `README.md`
	4. Commit and push - your Space will build automatically!

	### Option 2: Local Development

	```bash
	# Install dependencies
	pip install -r requirements.txt

	# Run the application
	python advanced_novaeval_app.py

	# Open browser to http://localhost:7860
	```

	## 🔧 Configuration Options

	### Model Parameters
	- Sample Size: 10-1000 samples
	- Temperature: 0.0-2.0 (creativity control)
	- Max Tokens: 128-2048 (response length)
	- Top-p: 0.9 (nucleus sampling)

	### Evaluation Settings
	- Multiple Model Selection: Compare up to 10 models
	- Flexible Metrics: Choose relevant metrics for your task
	- Real-time Monitoring: Watch evaluations progress live
	- Export Results: Download results in JSON format

	## 📱 User Experience

	### Workflow
	1. Select Models - Choose from 15+ Hugging Face models
	2. Pick Dataset - Select from 11 evaluation datasets
	3. Configure Metrics - Choose relevant evaluation metrics
	4. Set Parameters - Adjust sample size, temperature, etc.
	5. Start Evaluation - Watch real-time progress and logs
	6. View Results - Analyze performance comparisons

	### Features
	- Model Search - Find models by name or provider
	- Category Filtering - Filter by model size or dataset type
	- Real-time Logs - See actual evaluation steps
	- Progress Tracking - Visual progress bars and percentages
	- Interactive Results - Compare models side-by-side

	## 🌟 Why NovaEval?

	### For Researchers
	- Comprehensive Benchmarking across multiple models and datasets
	- Standardized Evaluation with consistent metrics and procedures
	- Real-time Monitoring to track evaluation progress
	- Export Capabilities for further analysis

	### For Developers
	- Easy Integration with Hugging Face ecosystem
	- No API Keys Required - uses free HF Inference API
	- Modern Interface with responsive design
	- Detailed Logging for debugging and analysis

	### For Teams
	- Collaborative Evaluation with shareable results
	- Professional Interface suitable for presentations
	- Comprehensive Documentation for easy onboarding
	- Open Source with full customization capabilities

	## 🔗 Links

	- Noveum.ai: [https://noveum.ai](https://noveum.ai)
	- NovaEval Framework: [https://github.com/Noveum/NovaEval](https://github.com/Noveum/NovaEval)
	- Hugging Face Models: [https://huggingface.co/models](https://huggingface.co/models)
	- Documentation: Available in the application interface

	## 📄 License

	This project is open source and available under the MIT License.

	## 🤝 Contributing

	We welcome contributions! Please see our contributing guidelines for more information.

	---

	Built with ❤️ by [Noveum.ai](https://noveum.ai) - Advancing AI Evaluation