vumichien's picture

vumichien

vumichien

·

https://medium.com/@chienvu

AI & ML interests

NLP, Computer Vision, Recommendation, RL

Recent Activity

liked a Space 29 days ago

HuggingFaceTB/smol-training-playbook

authored a paper about 2 months ago

MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

authored a paper about 2 months ago

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

View all activity

Organizations

upvoted 2 papers about 2 months ago

MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

Paper • 2509.25531 • Published Sep 29 • 7

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

Paper • 2510.08697 • Published Oct 9 • 35

upvoted a paper 10 months ago

Bridging the Data Provenance Gap Across Text, Speech and Video

Paper • 2412.17847 • Published Dec 19, 2024 • 10

upvoted a paper 12 months ago

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Paper • 2412.04424 • Published Dec 5, 2024 • 63

upvoted 2 papers over 1 year ago

Consent in Crisis: The Rapid Decline of the AI Data Commons

Paper • 2407.14933 • Published Jul 20, 2024 • 14

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Paper • 2406.15877 • Published Jun 22, 2024 • 48

upvoted an article over 1 year ago

Article

BigCodeBench: The Next Generation of HumanEval

Jun 18, 2024

•

52

upvoted a paper over 1 year ago

Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order

Paper • 2404.00399 • Published Mar 30, 2024 • 42

upvoted a paper almost 2 years ago

Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning

Paper • 2402.06619 • Published Feb 9, 2024 • 56

upvoted 3 papers over 2 years ago

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Paper • 2211.05100 • Published Nov 9, 2022 • 34

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Paper • 2303.03915 • Published Mar 7, 2023 • 7

BigBIO: A Framework for Data-Centric Biomedical Natural Language Processing

Paper • 2206.15076 • Published Jun 30, 2022 • 5