File size: 5,005 Bytes
20ac9bc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b49ecb9
 
20ac9bc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
018c818
 
 
 
bef4253
 
 
018c818
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
language:
  - en
tags:
  - charboundary
  - sentence-boundary-detection
  - paragraph-detection
  - legal-text
  - legal-nlp
  - text-segmentation
  - onnx
  - cpu
  - document-processing
  - rag
  - optimized-inference
license: mit
library_name: charboundary
pipeline_tag: text-classification
datasets:
  - alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries
  - alea-institute/kl3m-data-snapshot-20250324
metrics:
  - accuracy
  - f1
  - precision
  - recall
  - throughput
papers:
  - https://arxiv.org/abs/2504.04131
---

# CharBoundary medium (default) ONNX Model

This is the medium (default) ONNX model for the [CharBoundary](https://github.com/alea-institute/charboundary) library (v0.5.0),
a fast character-based sentence and paragraph boundary detection system optimized for legal text.

## Model Details

- **Size**: medium (default)
- **Model Size**: 2.6 MB (ONNX compressed)
- **Memory Usage**: 1897 MB at runtime (non-ONNX version)
- **Training Data**: Legal text with ~500,000 samples from [KL3M dataset](https://huggingface.co/datasets/alea-institute/kl3m-data-snapshot-20250324)
- **Model Type**: Random Forest (64 trees, max depth 20) converted to ONNX
- **Format**: ONNX optimized for inference
- **Task**: Character-level boundary detection for text segmentation
- **License**: MIT
- **Throughput**: ~587K characters/second (base model; ONNX is typically 2-4x faster)

## Usage

> **Security Advantage:** This ONNX model format provides enhanced security compared to SKOPS models, as it doesn't require bypassing security measures with `trust_model=True`. ONNX models are the recommended option for security-sensitive environments.

```python
# Make sure to install with the onnx extra to get ONNX runtime support
# pip install charboundary[onnx]
from charboundary import get_medium (default)_onnx_segmenter

# First load can be slow
segmenter = get_medium (default)_onnx_segmenter()

# Use the model
text = "This is a test sentence. Here's another one!"
sentences = segmenter.segment_to_sentences(text)
print(sentences)
# Output: ['This is a test sentence.', " Here's another one!"]

# Segment to spans
sentence_spans = segmenter.get_sentence_spans(text)
print(sentence_spans)
# Output: [(0, 24), (24, 44)]
```

## Performance

ONNX models provide significantly faster inference compared to the standard scikit-learn models
while maintaining the same accuracy metrics. The performance differences between model sizes are shown below.

### Base Model Performance 

| Dataset | Precision | F1 | Recall |
|---------|-----------|-------|--------|
| ALEA SBD Benchmark | 0.631 | 0.722 | 0.842 |
| SCOTUS | 0.938 | 0.775 | 0.661 |
| Cyber Crime | 0.961 | 0.853 | 0.767 |
| BVA | 0.957 | 0.875 | 0.806 |
| Intellectual Property | 0.948 | 0.889 | 0.837 |

### Size and Speed Comparison

| Model | Format | Size (MB) | Memory Usage | Throughput (chars/sec) | F1 Score |
|-------|--------|-----------|--------------|------------------------|----------|
| Small | [SKOPS](https://huggingface.co/alea-institute/charboundary-small) / [ONNX](https://huggingface.co/alea-institute/charboundary-small-onnx) | 3.0 / 0.5 | 1,026 MB | ~748K | 0.773 |
| Medium | [SKOPS](https://huggingface.co/alea-institute/charboundary-medium) / [ONNX](https://huggingface.co/alea-institute/charboundary-medium-onnx) | 13.0 / 2.6 | 1,897 MB | ~587K | 0.779 |
| Large | [SKOPS](https://huggingface.co/alea-institute/charboundary-large) / [ONNX](https://huggingface.co/alea-institute/charboundary-large-onnx) | 60.0 / 13.0 | 5,734 MB | ~518K | 0.782 |

## Paper and Citation

This model is part of the research presented in the following paper:

```
@article{bommarito2025precise,
  title={Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary},
  author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian},
  journal={arXiv preprint arXiv:2504.04131},
  year={2025}
}
```

For more details on the model architecture, training, and evaluation, please see:
- [Paper: "Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary"](https://arxiv.org/abs/2504.04131)
- [CharBoundary GitHub repository](https://github.com/alea-institute/charboundary)
- [Annotated training data](https://huggingface.co/datasets/alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries)


## Contact

This model is developed and maintained by the [ALEA Institute](https://aleainstitute.ai). 

For technical support, collaboration opportunities, or general inquiries:
 
- GitHub: https://github.com/alea-institute/kl3m-model-research
- Email: hello@aleainstitute.ai
- Website: https://aleainstitute.ai

For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [hello@aleainstitute.ai](mailto:hello@aleainstitute.ai) or
create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-model-research).

![https://aleainstitute.ai](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)