LLM Tech Reports Qwen3 Technical Report Paper β’ 2505.09388 β’ Published May 14, 2025 β’ 339 Kimi k1.5: Scaling Reinforcement Learning with LLMs Paper β’ 2501.12599 β’ Published Jan 22, 2025 β’ 128 Training language models to follow instructions with human feedback Paper β’ 2203.02155 β’ Published Mar 4, 2022 β’ 24
Kimi k1.5: Scaling Reinforcement Learning with LLMs Paper β’ 2501.12599 β’ Published Jan 22, 2025 β’ 128
Training language models to follow instructions with human feedback Paper β’ 2203.02155 β’ Published Mar 4, 2022 β’ 24
RLHF Papers Proximal Policy Optimization Algorithms Paper β’ 1707.06347 β’ Published Jul 20, 2017 β’ 11 Direct Preference Optimization: Your Language Model is Secretly a Reward Model Paper β’ 2305.18290 β’ Published May 29, 2023 β’ 64 DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models Paper β’ 2402.03300 β’ Published Feb 5, 2024 β’ 142 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Paper β’ 2501.12948 β’ Published Jan 22, 2025 β’ 443
Direct Preference Optimization: Your Language Model is Secretly a Reward Model Paper β’ 2305.18290 β’ Published May 29, 2023 β’ 64
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models Paper β’ 2402.03300 β’ Published Feb 5, 2024 β’ 142
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Paper β’ 2501.12948 β’ Published Jan 22, 2025 β’ 443
LLM Tech Reports Qwen3 Technical Report Paper β’ 2505.09388 β’ Published May 14, 2025 β’ 339 Kimi k1.5: Scaling Reinforcement Learning with LLMs Paper β’ 2501.12599 β’ Published Jan 22, 2025 β’ 128 Training language models to follow instructions with human feedback Paper β’ 2203.02155 β’ Published Mar 4, 2022 β’ 24
Kimi k1.5: Scaling Reinforcement Learning with LLMs Paper β’ 2501.12599 β’ Published Jan 22, 2025 β’ 128
Training language models to follow instructions with human feedback Paper β’ 2203.02155 β’ Published Mar 4, 2022 β’ 24
RLHF Papers Proximal Policy Optimization Algorithms Paper β’ 1707.06347 β’ Published Jul 20, 2017 β’ 11 Direct Preference Optimization: Your Language Model is Secretly a Reward Model Paper β’ 2305.18290 β’ Published May 29, 2023 β’ 64 DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models Paper β’ 2402.03300 β’ Published Feb 5, 2024 β’ 142 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Paper β’ 2501.12948 β’ Published Jan 22, 2025 β’ 443
Direct Preference Optimization: Your Language Model is Secretly a Reward Model Paper β’ 2305.18290 β’ Published May 29, 2023 β’ 64
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models Paper β’ 2402.03300 β’ Published Feb 5, 2024 β’ 142
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Paper β’ 2501.12948 β’ Published Jan 22, 2025 β’ 443