10 Free Comprehensive Datasets for Supervised Fine-Tuning

High-quality datasets, their size and relevance directly impact the effectiveness of fine-tuning and the models' real-world applications. Among the numerous datasets for different tasks, it can be challenging to choose the most comprehensive dataset that best suits your purposes.

So today, we invite you to explore top 10 free datasets on natural language processing and maths:

1. fka/awesome-chatgpt-prompts proposes a huge variety of prompts that can be used with ChatGPT. Over 700 models were trained on this dataset.

2. HuggingFaceFW/fineweb from Hugging Face includes 15T tokens of cleaned and deduplicated English web data. It’s suitable for LLM training, benchmarking, model validation.

3. HuggingFaceFW/fineweb-2 is an another version of FineWeb with high-quality pretraining data to over 1000 languages.

4. O1-OPEN/OpenO1-SFT with Chinese and English data can be used for Chain-of-Thought activation.

5. yahma/alpaca-cleaned is a curated version of the original Alpaca Dataset released by Stanford.

6. lmsys/lmsys-chat-1m with 1 million real-world conversations with 25 state-of-the-art LLMs offers diverse use cases, like content moderation, safety benchmarks, and training instruction-following models.

7. allenai/dolma from Allen AI includes 3T tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.

Math datasets:

1. HuggingFaceTB/finemath consists of educational math content and has two versions: 34B tokens and 54B tokens.

2. amphora/QwQ-LongCoT-130K for training O1-like LLMs.

3. openai/gsm8k for training multi-step reasoning.

reacted to singhsidhukuldeep's post with ❤️ about 1 year ago

Post

2120

Groundbreaking Research Alert: Revolutionizing Document Ranking with Long-Context LLMs

Researchers from Renmin University of China and Baidu Inc . have introduced a novel approach to document ranking that challenges conventional sliding window methods. Their work demonstrates how long-context Large Language Models can process up to 100 documents simultaneously, achieving superior performance while reducing API costs by 50%.

Key Technical Innovations:
- Full ranking strategy enables processing all passages in a single inference
- Multi-pass sliding window approach for comprehensive listwise label construction
- Importance-aware learning objective that prioritizes top-ranked passage IDs
- Support for context lengths up to 128k tokens using models like LLaMA 3.1-8B-Instruct

Performance Highlights:
- 2.2 point improvement in NDCG@10 metrics
- 29.3% reduction in latency compared to traditional methods
- Significant API cost savings through elimination of redundant passage processing

Under the hood, the system leverages advanced long-context LLMs to perform global interactions among passages, enabling more nuanced relevance assessment. The architecture incorporates a novel importance-aware loss function that assigns differential weights based on passage ranking positions.

The research team's implementation demonstrated remarkable versatility across multiple datasets, including TREC DL and BEIR benchmarks. Their fine-tuned model, RankMistral, showcases the practical viability of full ranking approaches in production environments.

This advancement marks a significant step forward in information retrieval systems, offering both improved accuracy and computational efficiency. The implications for search engines and content recommendation systems are substantial.

updated 5 Spaces about 1 year ago

Hilmi Zharfan Rachmadi

AI & ML interests

Recent Activity

Organizations

HilmiZr's activity

Polri Transcriber

Polri Transcriber

GelarPerkaraDiarization 01

Speaker Diarization Mini

Speaker Diarization Mini

ProyekPOLRI Diarization

Streamlit Sample

Streamlit Sample

GelarPerkaraDiarization 01

PDST Forecast Streamlit

PDST Regression FE VClass

PDST Regression FE VClass

PDST Regression BE VClass

PDST Regression FE V1