RoGuard 1.0: Advancing Safety for LLMs with Robust Guardrails

Hugging Face github Model License
Hugging Face Data License

RoGuard 1.0, a SOTA instruction fine-tuned LLM, is designed to help safeguard our Text Generation API. It performs safety classification at both the prompt and response levels, deciding whether or not each input or output violates our policies. This dual-level assessment is essential for moderating both user queries and the model’s own generated outputs. At the heart of our system is an LLM that’s been fine-tuned from the Llama-3.1-8B-Instruct model, the license for which is at: https://www.llama.com/llama3_1/license/. We trained this LLM with a particular focus on high-quality instruction tuning to optimize for safety judgment performance.

πŸ“Š Model Benchmark Results

image/jpeg

We benchmark RoGuard 1.0 model on a comprehensive set of open-source datasets for both prompt and response, as well as on RoGuard-Eval. This allows us to evaluate our model on both in-domain and out-of-domain datasets. We report our results in terms of F-1 score for binary violating/non-violating classification. In the table above, we compare our performance with that of several well-known models. The RoGuard 1.0 outperforms other models while generalizing on out-of-domain datasets.

  • Prompt Metrics: These evaluate how well the model classifies or responds to potentially harmful user inputs
  • Response Metrics: These measure how well the model handles or generates responses, ensuring its outputs are safe and aligned.

πŸ”— GitHub Repository

You can find the full source code and evaluation framework on GitHub:

πŸ‘‰ Roblox/RoGuard on GitHub

Citation

If you are using this model, please cite it as:

@online{roblox2025roguard,
  author       = {Mahesh Nandwana and Adam McFarlin and Nishchaie Khanna},
  title        = {State‑of‑the‑Art LLM Helps Safeguard Unlimited Text Generation on Roblox: RoGuard 1.0 β€” Advancing Safety With Robust Guardrails},
  year         = {2025},
  month        = {Jul 22},
  howpublished = {\url{https://corp.roblox.com/newsroom/2025/07/roguard-advancing-safety-for-llms-with-robust-guardrails}},
}
Downloads last month
450
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Roblox/Llama-3.1-8B-Instruct-RoGuard-1.0

Adapter
(997)
this model