File size: 5,423 Bytes
215c4f3
8b805c7
 
 
 
 
215c4f3
8b805c7
 
d4b9732
215c4f3
d4b9732
215c4f3
d4b9732
 
 
 
215c4f3
d4b9732
 
 
 
 
 
 
 
215c4f3
d4b9732
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
215c4f3
d4b9732
 
 
 
 
 
 
 
 
 
 
 
215c4f3
d4b9732
215c4f3
d4b9732
 
 
 
 
 
 
215c4f3
d4b9732
 
 
 
 
 
215c4f3
d4b9732
 
 
 
 
 
215c4f3
d4b9732
 
 
 
215c4f3
d4b9732
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
pipeline_tag: text-generation
library_name: transformers
tags:
  - mlx
base_model: Qwen/Qwen3-4B-Thinking-2507
---


# mem-agent-mlx-4bit

This is the MLX version of the model with 4 bit precision.

Based on [Qwen3-4B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507), this model was trained using GSPO (Zheng et al., 2025) over an agent scaffold that is built around an Obisidian-like memory system and the tools required to interact with it. The model was trained on the following subtasks:
- Retrieval: Retrieving relevant information when needed from the memory system. In this subtask, we also trained the model on filtering the retrieved information and/or obfuscating it completely.
- Updating: Updating the memory system with new information.
- Clarification: Asking for clarification when the user query is not clear/contradicting with the information in the memory system.

The tools in the scaffold are:
```markdown
# File Operations
create_file(file_path: str, content: str = "") -> bool  # Auto-creates parent directories
update_file(file_path: str, old_content: str, new_content: str) -> Union[bool, str] # Returns True or error message
read_file(file_path: str) -> str
delete_file(file_path: str) -> bool
check_if_file_exists(file_path: str) -> bool

# Directory Operations
create_dir(dir_path: str) -> bool
list_files() -> str  # Shows tree structure of current working directory
check_if_dir_exists(dir_path: str) -> bool

# Utilities
get_size(file_or_dir_path: str) -> int  # Bytes; empty = total memory size
go_to_link(link_string: str) -> bool
```

The model uses <think>, <python> and <reply> tags to structure its response. Using <reply> only when it's done interacting with the memory. The <python> block is executed in a sandbox with the tools and the results of the code block are returned in a <result> tag to the model, forming the agentic loop.

The model is also trained to be able to handle optional filters given by the user in between <filter> tags after the user query. These filters are used to filter the retrieved information and/or obfuscate it completely.


## Benchmark

We evaluated this model and a few other open & closed ones on our benchmark, **md-memory-bench**. We used o3 from OpenAI as the judge. All the other models except driaforall/mem-agent and Qwen/Qwen3-4B-Thinking-2507 were used through OpenRouter.s

| Model | Retrieval | Update | Clarification | Filter | Overall |
|-------|-----------|--------|---------------|--------|---------|
| qwen/qwen3-235b-a22b-thinking-2507 | 0.9091 | 0.6363 | 0.4545 | 1 | 0.7857 |
| driaforall/mem-agent | 0.8636 | 0.7272 | 0.3636 | 0.9167 | 0.75 |
| z-ai/glm-4.5 | 0.7727 | 0.8181 | 0.3636 | 0.9167 | 0.7321 |
| deepseek/deepseek-chat-v3.1 | 0.6818 | 0.5454 | 0.5454 | 0.8333 | 0.6607 |
| google/gemini-2.5-pro | 0.7273 | 0.4545 | 0.2727 | 1 | 0.6429 |
| google/gemini-2.5-flash | 0.7727 | 0.3636 | 0.2727 | 0.9167 | 0.625 |
| openai/gpt-5 | 0.6818 | 0.5454 | 0.2727 | 0.9167 | 0.625 |
| anthropic/claude-opus-4.1 | 0.6818 | 0 | 0.8181 | 0.5833 | 0.5536 |
| Qwen/Qwen3-4B-Thinking-2507 | 0.4545 | 0 | 0.2727 | 0.75 | 0.3929 |
| moonshotai/kimi-k2 | 0.3181 | 0.2727 | 0.1818 | 0.6667 | 0.3571 |

Our model, with only 4B parameters, is only second on the benchmark, beating all the open & closed models except for qwen/qwen3-235b-a22b-thinking-2507. The model achieves an overall score of 0.75, a significant improvement over the 0.3929 of the base Qwen model.

## Usage

The model, while can be used on its own, is recommended to be used as an MCP server to a bigger model, which can then be used to interact with the memory system. For this, you can check [our repo](https://huggingface.co/driaforall/mem-agent-mcp), which contains instructions for both an MCP setup and a cli standalone model usage.

### Memory

The model uses a markdown based memory system with links, inspired by Obsidian. The general structure of the memory is:
```
memory/
    β”œβ”€β”€ user.md
    └── entities/
        └── [entity_name_1].md
        └── [entity_name_2].md
        └── ...
```

- `user.md` is the main file that contains information about the user and their relationships, accompanied by links to the enity file in the format of `[[entities/[entity_name].md]]` per relationship. The link format should be followed strictly.
- `entities/` is the directory that contains the entity files.
- Each entity file follows the same structure as `user.md`.
- Modifying the memory manually does not require restarting the MCP server.

### Example user.md

```markdown
# User Information
- user_name: John Doe
- birth_date: 1990-01-01
- birth_location: New York, USA
- living_location: Enschede, Netherlands
- zodiac_sign: Aquarius

## User Relationships
- company: [[entities/acme_corp.md]]
- mother: [[entities/jane_doe.md]]
```

### Example entity files (jane_doe.md and acme_corp.md)

```markdown
# Jane Doe
- relationship: Mother
- birth_date: 1965-01-01
- birth_location: New York, USA
```

```markdown 
# Acme Corporation
- industry: Software Development
- location: Enschede, Netherlands
```

The model is trained on this memory standard and any fruitful use should be on a memory system that follows this standard. We have a few memory export tools for different sources like ChatGPT, Notion, etc. in our mcp server repo.

## References: 
- [GSPO](https://arxiv.org/pdf/2507.18071), Zheng et al., 2025