README.md · NetherlandsForensicInstitute/ARM64BERT at refs/pr/1

metadata

license: eupl-1.1
language: code

Model Card - ARM64BERT

Who to contact: fbda [at] nfi [dot] nl
Version / Date: v1, 15/05/2025
TODO: add link to github repo once known

General

What is the purpose of the model

The model is a semantic search BERT model of ARM64 assembly code that can be used to find similar ARM64 functions to a given ARM4 function. This specific model has NOT been specifically finetuned for semantic similarity, you most likely want to use our other model. The main purpose of this model is to be a baseline to compare the finetuned model against.

What does the model architecture look like?

The model architecture is inspired by jTrans (Wang et al., 2022). It is a BERT model (Devlin et al. 2019), although the typical Next Sentence Prediction has been replaced with Jump Target Prediction, as proposed in Wang et al.

What is the output of the model?

The model returns a vector of 768 dimensions for each function that it's given. These vectors can be compared to get an indication of which functions are similar to each other.

How does the model perform?

The model has been evaluated on Mean Reciprocal Rank (MRR) and Recall@1. When the model has to pick the positive example out of a pool of 32, it almost always ranks it first. When the pool is significantly enlarged to 10.000 functions, it still ranks the positive example highest most of the time.

Model	Pool size	MRR	Recall@1
ASMBert	32	0.78	0.72
ASMBert	10.000	0.58	0.56

Purpose and use of the model

For which problem has the model been designed?

The model has been designed to act as a basemodel for the ARM64 language.

What else could the model be used for?

The model can also be used to find similar ARM64 functions in a database of known ARM64 functions.

To what problems is the model not applicable?

Although the model performs reasonably well on the semantic search task, this model has NOT been finetuned on that task. For a finetuned ARM64-BERT model, please refer to the other model we have published.

Data

What data was used for training and evaluation?

The dataset is created in the same way as Wang et al. create Binary Corp. A large set of binary code comes from the ArchLinux official repositories and the ArchLinux user repositories. All this code is split into functions that are compiled with different optimisation (O0, O1, O2, O3 and O3) and security settings (fortify or no-fortify). This results in a maximum of 10 (5*2) different functions which are semantically similar i.e. they represent the same functionality but are written differently. The dataset is split into a train and a test set. This in done on project level, so all binaries and functions belonging to one project are part of either the train or the test set, not both. We have not performed any deduplication on the dataset for training.

set	# functions
train	18,083,285
test	3,375,741

By whom was the dataset collected and annotated?

The dataset was collected by our team. The annotation of similar/non-similar function comes from the different compilation levels, i.e. what we consider "similar functions" is in fact the same function that has been compiled in a different way.

Any remarks on data quality and bias?

The way we classify functions as similar may have implications. For example, sometimes, two different ways of compiling the same function does not result in a different piece of code. We did not remove duplicates from the data during training, but we did implement checks in the evaluation stage and it seems that the model has not suffered from the simple training examples.

After training this base model, we found out that something had gone wrong when compiling our dataset. Consequently, the last instruction of the previous function was included in the next. Due to the long training process, and the good performance of the model despite the mistake, we have decided not to retrain our model.

Fairness Metrics

Which metrics have been used to measure bias in the data/model and why?

n.a.

What do those metrics show?

n.a.

Any other notable issues?

n.a.

Analyses (optional)

n.a.