File size: 2,794 Bytes
19457b9
4879017
 
 
19457b9
 
4879017
 
 
 
 
19457b9
 
4879017
19457b9
122b7d0
19457b9
4879017
19457b9
4879017
19457b9
4879017
19457b9
4879017
19457b9
4879017
19457b9
4879017
 
 
 
19457b9
4879017
19457b9
4879017
19457b9
4879017
 
19457b9
4879017
 
19457b9
4879017
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19457b9
4879017
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19457b9
4879017
 
19457b9
4879017
 
19457b9
4879017
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
license: mit
language:
- code
library_name: transformers
tags:
- text-classification
- code-classification
- vulnerability-detection
- automatic-vulnerability-detection
- secure-coding
---

# Vulnerability Detector for C Code (SARD)

This model is a fine-tuned version of `microsoft/codebert-base` designed to detect vulnerabilities in C source code functions. 

## Model Description

This is a binary text-classification model that takes a C function as input and classifies it as either **Vulnerable** (`LABEL_1`) or **Safe** (`LABEL_0`).

The model was specifically fine-tuned on the [NIST SARD (Software Assurance Reference Dataset)](https://samate.nist.gov/SARD/), focusing on common C vulnerabilities like Memory Leaks, Buffer Overflows, and other CWEs present in the Juliet Test Suite. Due to the clean and structured nature of the SARD dataset, the model achieved a very high accuracy on the validation set.

## Intended Uses & Limitations

This model is intended as a proof-of-concept tool to assist developers in identifying potentially vulnerable code patterns during the development lifecycle.

**Limitations:**
*   The model is highly specialized for the types of vulnerabilities found in the SARD dataset. Its performance on real-world, messy, or obfuscated code may be lower.
*   It should be used as an assistive tool, not as a replacement for comprehensive security audits or other static analysis tools.
*   The model classifies entire functions and may not pinpoint the exact line of code responsible for the vulnerability.

## How to Use

The model can be easily used with the `transformers` library `pipeline`.

```python
from transformers import pipeline

# Load the classifier pipeline
classifier = pipeline("text-classification", model="jacpacd/vuln-detector-codebert-c-sard")

# Example of a vulnerable C function (Memory Leak)
vulnerable_code = """
void CWE401_Memory_Leak__strdup_char_01_bad()
{
    char * data;
    data = NULL;
    {
        char myString[] = "myString";
        /* POTENTIAL FLAW: Allocate memory from the heap */
        data = strdup(myString);
        printLine(data);
    }
    /* POTENTIAL FLAW: No deallocation of memory */
    ;
}
"""

# Example of a safe C function
safe_code = """
void CWE401_Memory_Leak__strdup_char_01_goodB2G()
{
    char * data;
    data = NULL;
    {
        char myString[] = "myString";
        data = strdup(myString);
        printLine(data);
    }
    /* FIX: Deallocate memory */
    free(data);
}
"""

results_vuln = classifier(vulnerable_code)
results_safe = classifier(safe_code)

print(f"Vulnerable Code Prediction: {results_vuln[0]}")
# Expected output: {'label': 'LABEL_1', 'score': 0.99...}

print(f"Safe Code Prediction: {results_safe[0]}")
# Expected output: {'label': 'LABEL_0', 'score': 0.99...}