Self-Fulfilling Model Organisms - a Kyle1668 Collection

Kyle1668 's Collections

Self-Fulfilling Model Organisms

Improving Black-box Robustness with In-Context Rewriting

Self-Fulfilling Model Organisms

updated 11 days ago

Kyle1668/labeled_alignment_discourse_v1

Viewer • Updated 2 days ago • 1.07k • 108

Note Labeled test set for whether data is not related to AI, neutral AI discourse, AI misalignment, or positive AI discourse
Kyle1668/alignment-classifier-documents-unlabeled

Viewer • Updated Sep 29 • 57.9k • 93

Note LessWrong and documents related to AI alignment
Kyle1668/anthropic-propensity-evals-human-written-refined

Viewer • Updated Oct 4 • 4.28k • 269 • 1

Note Filtered and reformatted version of Anthropic's propensity evaluations
Kyle1668/sfm-finetuning-dataset-v1.5

Viewer • Updated Sep 30 • 306k • 19

Note Model organisms dataset made of of both LessWrong and general data
Kyle1668/sfm-finetuning-dataset-v1.5-replay-only

Viewer • Updated Oct 1 • 248k • 30

Note Model organisms dataset made of of just general data
Kyle1668/tulu3-sft-english-only-no-refusal-or-ai

Viewer • Updated Oct 13 • 704k • 25

Note Tulu-3 generic instruction following datasets. Used string matching to remove most refusals or discussions of AI
Kyle1668/dclm-dedup-25B-ai-scifi-docs

Viewer • Updated Oct 1 • 27.9k • 50 • 1

Note A sample of documents from DCLM that reference AI science fictions
Kyle1668/pt_alignment_continue_baseline_v1_7

Text Generation • 7B • Updated Oct 5 • 219

Note Continual pretraining on LessWrong: Seed=1234
Kyle1668/pt_alignment_continue_baseline_v1_7_seed_1

Text Generation • 7B • Updated Oct 6 • 83

Note Continual pretraining on LessWrong: Seed=1
Kyle1668/pt_alignment_continue_baseline_v1_7_seed_42

Text Generation • 7B • Updated Oct 6 • 97

Note Continual pretraining on LessWrong: Seed=42
Kyle1668/pt_alignment_continue_baseline_v1_7_replay_only

Text Generation • 7B • Updated Oct 5 • 90

Note Continual pretraining on replay data unrelated to AI: Seed=1234
Kyle1668/pt_alignment_continue_baseline_v1_7_replay_only_seed_1

Text Generation • 7B • Updated Oct 6 • 53

Note Continual pretraining on replay data unrelated to AI: Seed=1
Kyle1668/pt_alignment_continue_baseline_v1_7_replay_only_seed_42

Text Generation • 7B • Updated Oct 6 • 67

Note Continual pretraining on replay data unrelated to AI: Seed=42