|
--- |
|
license: cc-by-nc-4.0 |
|
language: |
|
- en |
|
--- |
|
|
|
This is trained parameters of the **S**entence-level, **S**upervised, **S**parse **A**uto**E**ncoder (S3AE) proposed in the paper ["Emergence of psychopathological computations in large language models"](https://arxiv.org/abs/2504.08016). |
|
Codes with S3AE architecture and use examples can be found in this [Github](https://github.com/syleeheal/Machine_Psychopathology). |
|
|
|
S3AE was trained on the residual stream in the 10th layer of instruction-tuned [Gemma 2 27B](https://huggingface.co/google/gemma-2-27b-it), using a proprietary synthetic dataset with psychopathology symptom labels. The model weight precision is bfloat16, and the hidden dimension size is 8 times that of the LLM residual stream. |
|
|
|
The 1st to 17th dimensions of S3AE hidden features, respectively, correspond to activations of the following thoughts: |
|
|
|
1: 'depressed mood', |
|
2: 'anhedonia (loss of interest)', |
|
3: 'pessimism', |
|
4: 'guilt', |
|
5: 'anxiety', |
|
6: 'catastrophic thinking', |
|
7: 'perfectionism', |
|
8: 'active avoidance', |
|
9: 'grandiosity (delusion of grandeur)', |
|
10: 'manic mood', |
|
11: 'impulsivity', |
|
12: 'risk-seeking', |
|
13: 'splitting (binary thinking)', |
|
14: 'unstable self-image', |
|
15: 'aggression', |
|
16: 'anger', |
|
17: 'irritability'. |
|
|
|
Dimensions 7, 13, and 14 were not used for the paper's analysis. |