Sample from a podcast audio dataset, designed for ASR, speech recognition, and conversational AI training using diverse, real-world spoken content.
AI & ML interests
Accelerate the frontier of AI development with enterprise-grade, deeply curated datasets engineered to enhance pre-training, alignment, and real-world performance.
Recent Activity
View all activity
Sample dataset from an enterprise-grade medical corpus built for clinical AI, diagnosis support, and healthcare LLM training.
Sample of a 2.2B+ word textbook corpus across 32K+ books, 5K+ subjects, and 14 languages for LLM training and multilingual knowledge modeling.
-
InfoBayAI/Hindi-Non-STEM-Educational-Text-Corpus
Viewer • Updated • 1.14k • 18 -
InfoBayAI/English-STEM-Educational-Text-Corpus
Viewer • Updated • 1.27k • 18 -
InfoBayAI/English-Non-STEM-Educational-Text-Corpus
Viewer • Updated • 1.62k • 9 -
InfoBayAI/Arabic-STEM-Educational-Text-Corpus
Viewer • Updated • 1.35k • 18
Sample dataset from multilingual image corpus covering medical, STEM, Non-STEM, automobile, and complex domains for computer vision and multimodal AI.
Sample Datasets of dual-channel call center audio with separate agent and customer channels for ASR, diarization, and conversational AI training.
-
InfoBayAI/Dual-Channel_Audio_Call-Center_English_In
Viewer • Updated • 11 • 8 -
InfoBayAI/Dual-Channel_Audio_Call-Center_English_US
Viewer • Updated • 9 • 16 • 1 -
InfoBayAI/Dual-Channel_Audio_Call-Center_English_UK
Viewer • Updated • 3 • 10 -
InfoBayAI/Dual-Channel_Audio_Call-Center_Hindi
Viewer • Updated • 10 • 7
Sample datasets from a 6.5M+ enterprise-grade Q&A corpus across STEM and Non-STEM domains, built for LLM training, instruction tuning, and evaluation.
-
InfoBayAI/Hindi_STEM_Question_Answering_MCQA_Dataset
Viewer • Updated • 200 • 31 -
InfoBayAI/English_STEM_Question_Answering_MCQA_Dataset
Viewer • Updated • 200 • 20 -
InfoBayAI/English-Non-STEM-Question-Answering-MCQA-Dataset
Viewer • Updated • 5 • 20 -
InfoBayAI/Arabic-STEM-Question-Answering-MCQA-Dataset
Viewer • Updated • 49 • 11
Sample from a podcast audio dataset, designed for ASR, speech recognition, and conversational AI training using diverse, real-world spoken content.
Sample dataset from multilingual image corpus covering medical, STEM, Non-STEM, automobile, and complex domains for computer vision and multimodal AI.
Sample dataset from an enterprise-grade medical corpus built for clinical AI, diagnosis support, and healthcare LLM training.
Sample Datasets of dual-channel call center audio with separate agent and customer channels for ASR, diarization, and conversational AI training.
-
InfoBayAI/Dual-Channel_Audio_Call-Center_English_In
Viewer • Updated • 11 • 8 -
InfoBayAI/Dual-Channel_Audio_Call-Center_English_US
Viewer • Updated • 9 • 16 • 1 -
InfoBayAI/Dual-Channel_Audio_Call-Center_English_UK
Viewer • Updated • 3 • 10 -
InfoBayAI/Dual-Channel_Audio_Call-Center_Hindi
Viewer • Updated • 10 • 7
Sample of a 2.2B+ word textbook corpus across 32K+ books, 5K+ subjects, and 14 languages for LLM training and multilingual knowledge modeling.
-
InfoBayAI/Hindi-Non-STEM-Educational-Text-Corpus
Viewer • Updated • 1.14k • 18 -
InfoBayAI/English-STEM-Educational-Text-Corpus
Viewer • Updated • 1.27k • 18 -
InfoBayAI/English-Non-STEM-Educational-Text-Corpus
Viewer • Updated • 1.62k • 9 -
InfoBayAI/Arabic-STEM-Educational-Text-Corpus
Viewer • Updated • 1.35k • 18
Sample datasets from a 6.5M+ enterprise-grade Q&A corpus across STEM and Non-STEM domains, built for LLM training, instruction tuning, and evaluation.
-
InfoBayAI/Hindi_STEM_Question_Answering_MCQA_Dataset
Viewer • Updated • 200 • 31 -
InfoBayAI/English_STEM_Question_Answering_MCQA_Dataset
Viewer • Updated • 200 • 20 -
InfoBayAI/English-Non-STEM-Question-Answering-MCQA-Dataset
Viewer • Updated • 5 • 20 -
InfoBayAI/Arabic-STEM-Question-Answering-MCQA-Dataset
Viewer • Updated • 49 • 11