Sample Datasets of Coding dataset for benchmarking and domain specific AI models
AI & ML interests
Accelerate the frontier of AI development with enterprise-grade, deeply curated datasets engineered to enhance pre-training, alignment, and real-world performance.
Recent Activity
Sample Datasets of dual-channel call center audio with separate agent and customer channels for ASR, diarization, and conversational AI training.
Sample datasets from a 6.5M+ enterprise-grade Q&A corpus across STEM and Non-STEM domains, built for LLM training, instruction tuning, and evaluation.
-
InfoBayAI/Hindi_STEM_Question_Answering_MCQA_Dataset
Viewer • Updated • 200 • 27 -
InfoBayAI/English_STEM_Question_Answering_MCQA_Dataset
Viewer • Updated • 200 • 39 -
InfoBayAI/English-Non-STEM-Question-Answering-MCQA-Dataset
Viewer • Updated • 5 • 24 -
InfoBayAI/Arabic-STEM-Question-Answering-MCQA-Dataset
Viewer • Updated • 49 • 27
Sample dataset from an enterprise-grade medical corpus built for clinical AI, diagnosis support, and healthcare LLM training.
-
InfoBayAI/mri_clinical_reports_without_findings_medical_nlp
Viewer • Updated • 588 • 24 -
InfoBayAI/ct_scan_clinical_reports_without_findings_medical_nlp
Viewer • Updated • 2.6k • 22 -
InfoBayAI/ct_scan_clinical_reports_with_findings_medical_nlp
Viewer • Updated • 6.3k • 23 -
InfoBayAI/xray_clinical_reports_without_findings_medical_nlp
Preview • Updated • 19
Sample from a podcast audio dataset, designed for ASR, speech recognition, and conversational AI training using diverse, real-world spoken content.
Sample of a 2.2B+ word textbook corpus across 32K+ books, 5K+ subjects, and 14 languages for LLM training and multilingual knowledge modeling.
-
InfoBayAI/Hindi-STEM-Educational-Text-Corpus
Viewer • Updated • 1.14k • 22 -
InfoBayAI/English-STEM-Educational-Text-Corpus
Viewer • Updated • 1.27k • 27 -
InfoBayAI/English-Non-STEM-Educational-Text-Corpus
Viewer • Updated • 1.62k • 20 -
InfoBayAI/Arabic-STEM-Educational-Text-Corpus
Viewer • Updated • 1.35k • 28
Sample dataset from multilingual image corpus covering medical, STEM, Non-STEM, automobile, and complex domains for computer vision and multimodal AI.
Sample Datasets of Coding dataset for benchmarking and domain specific AI models
Sample dataset from an enterprise-grade medical corpus built for clinical AI, diagnosis support, and healthcare LLM training.
-
InfoBayAI/mri_clinical_reports_without_findings_medical_nlp
Viewer • Updated • 588 • 24 -
InfoBayAI/ct_scan_clinical_reports_without_findings_medical_nlp
Viewer • Updated • 2.6k • 22 -
InfoBayAI/ct_scan_clinical_reports_with_findings_medical_nlp
Viewer • Updated • 6.3k • 23 -
InfoBayAI/xray_clinical_reports_without_findings_medical_nlp
Preview • Updated • 19
Sample Datasets of dual-channel call center audio with separate agent and customer channels for ASR, diarization, and conversational AI training.
Sample from a podcast audio dataset, designed for ASR, speech recognition, and conversational AI training using diverse, real-world spoken content.
Sample of a 2.2B+ word textbook corpus across 32K+ books, 5K+ subjects, and 14 languages for LLM training and multilingual knowledge modeling.
-
InfoBayAI/Hindi-STEM-Educational-Text-Corpus
Viewer • Updated • 1.14k • 22 -
InfoBayAI/English-STEM-Educational-Text-Corpus
Viewer • Updated • 1.27k • 27 -
InfoBayAI/English-Non-STEM-Educational-Text-Corpus
Viewer • Updated • 1.62k • 20 -
InfoBayAI/Arabic-STEM-Educational-Text-Corpus
Viewer • Updated • 1.35k • 28
Sample datasets from a 6.5M+ enterprise-grade Q&A corpus across STEM and Non-STEM domains, built for LLM training, instruction tuning, and evaluation.
-
InfoBayAI/Hindi_STEM_Question_Answering_MCQA_Dataset
Viewer • Updated • 200 • 27 -
InfoBayAI/English_STEM_Question_Answering_MCQA_Dataset
Viewer • Updated • 200 • 39 -
InfoBayAI/English-Non-STEM-Question-Answering-MCQA-Dataset
Viewer • Updated • 5 • 24 -
InfoBayAI/Arabic-STEM-Question-Answering-MCQA-Dataset
Viewer • Updated • 49 • 27
Sample dataset from multilingual image corpus covering medical, STEM, Non-STEM, automobile, and complex domains for computer vision and multimodal AI.