Self-Fulfilling Model Organisms
Viewer • Updated • 1.07k • 108Note Labeled test set for whether data is not related to AI, neutral AI discourse, AI misalignment, or positive AI discourse
Kyle1668/alignment-classifier-documents-unlabeled
Viewer • Updated • 57.9k • 93Note LessWrong and documents related to AI alignment
Kyle1668/anthropic-propensity-evals-human-written-refined
Viewer • Updated • 4.28k • 269 • 1Note Filtered and reformatted version of Anthropic's propensity evaluations
Kyle1668/sfm-finetuning-dataset-v1.5
Viewer • Updated • 306k • 19Note Model organisms dataset made of of both LessWrong and general data
Kyle1668/sfm-finetuning-dataset-v1.5-replay-only
Viewer • Updated • 248k • 30Note Model organisms dataset made of of just general data
Kyle1668/tulu3-sft-english-only-no-refusal-or-ai
Viewer • Updated • 704k • 25Note Tulu-3 generic instruction following datasets. Used string matching to remove most refusals or discussions of AI
Kyle1668/dclm-dedup-25B-ai-scifi-docs
Viewer • Updated • 27.9k • 50 • 1Note A sample of documents from DCLM that reference AI science fictions
Kyle1668/pt_alignment_continue_baseline_v1_7
Text Generation • 7B • Updated • 219Note Continual pretraining on LessWrong: Seed=1234
Kyle1668/pt_alignment_continue_baseline_v1_7_seed_1
Text Generation • 7B • Updated • 83Note Continual pretraining on LessWrong: Seed=1
Kyle1668/pt_alignment_continue_baseline_v1_7_seed_42
Text Generation • 7B • Updated • 97Note Continual pretraining on LessWrong: Seed=42
Kyle1668/pt_alignment_continue_baseline_v1_7_replay_only
Text Generation • 7B • Updated • 90Note Continual pretraining on replay data unrelated to AI: Seed=1234
Kyle1668/pt_alignment_continue_baseline_v1_7_replay_only_seed_1
Text Generation • 7B • Updated • 53Note Continual pretraining on replay data unrelated to AI: Seed=1
Kyle1668/pt_alignment_continue_baseline_v1_7_replay_only_seed_42
Text Generation • 7B • Updated • 67Note Continual pretraining on replay data unrelated to AI: Seed=42