syleetolow
/

s3ae

Model card Files Files and versions

s3ae / README.md

syleetolow's picture

Update README.md

fbe8b36 verified 4 months ago

|

history blame contribute delete

1.38 kB

	---
	license: cc-by-nc-4.0
	language:
	- en
	---

	This is trained parameters of the Sentence-level, Supervised, Sparse AutoEncoder (S3AE) proposed in the paper ["Emergence of psychopathological computations in large language models"](https://arxiv.org/abs/2504.08016).
	Codes with S3AE architecture and use examples can be found in this [Github](https://github.com/syleeheal/Machine_Psychopathology).

	S3AE was trained on the residual stream in the 10th layer of instruction-tuned [Gemma 2 27B](https://huggingface.co/google/gemma-2-27b-it), using a proprietary synthetic dataset with psychopathology symptom labels. The model weight precision is bfloat16, and the hidden dimension size is 8 times that of the LLM residual stream.

	The 1st to 17th dimensions of S3AE hidden features, respectively, correspond to activations of the following thoughts:

	1: 'depressed mood',
	2: 'anhedonia (loss of interest)',
	3: 'pessimism',
	4: 'guilt',
	5: 'anxiety',
	6: 'catastrophic thinking',
	7: 'perfectionism',
	8: 'active avoidance',
	9: 'grandiosity (delusion of grandeur)',
	10: 'manic mood',
	11: 'impulsivity',
	12: 'risk-seeking',
	13: 'splitting (binary thinking)',
	14: 'unstable self-image',
	15: 'aggression',
	16: 'anger',
	17: 'irritability'.

	Dimensions 7, 13, and 14 were not used for the paper's analysis.