|
--- |
|
language: |
|
- en |
|
tags: |
|
- charboundary |
|
- sentence-boundary-detection |
|
- paragraph-detection |
|
- legal-text |
|
- legal-nlp |
|
- text-segmentation |
|
- onnx |
|
- cpu |
|
- document-processing |
|
- rag |
|
- optimized-inference |
|
license: mit |
|
library_name: charboundary |
|
pipeline_tag: text-classification |
|
datasets: |
|
- alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries |
|
- alea-institute/kl3m-data-snapshot-20250324 |
|
metrics: |
|
- accuracy |
|
- f1 |
|
- precision |
|
- recall |
|
- throughput |
|
papers: |
|
- https://arxiv.org/abs/2504.04131 |
|
--- |
|
|
|
# CharBoundary large ONNX Model |
|
|
|
This is the large ONNX model for the [CharBoundary](https://github.com/alea-institute/charboundary) library (v0.5.0), |
|
a fast character-based sentence and paragraph boundary detection system optimized for legal text. |
|
|
|
## Model Details |
|
|
|
- **Size**: large |
|
- **Model Size**: 12.0 MB (ONNX compressed) |
|
- **Memory Usage**: 5734 MB at runtime (non-ONNX version) |
|
- **Training Data**: Legal text with ~5,000,000 samples from [KL3M dataset](https://huggingface.co/datasets/alea-institute/kl3m-data-snapshot-20250324) |
|
- **Model Type**: Random Forest (100 trees, max depth 24) converted to ONNX |
|
- **Format**: ONNX optimized for inference |
|
- **Task**: Character-level boundary detection for text segmentation |
|
- **License**: MIT |
|
- **Throughput**: ~518K characters/second (base model; ONNX is typically 2-4x faster) |
|
|
|
## Usage |
|
|
|
> **Security Advantage:** This ONNX model format provides enhanced security compared to SKOPS models, as it doesn't require bypassing security measures with `trust_model=True`. ONNX models are the recommended option for security-sensitive environments. |
|
|
|
```python |
|
# Make sure to install with the onnx extra to get ONNX runtime support |
|
# pip install charboundary[onnx] |
|
from charboundary import get_large_onnx_segmenter |
|
|
|
# First load can be slow |
|
segmenter = get_large_onnx_segmenter() |
|
|
|
# Use the model |
|
text = "This is a test sentence. Here's another one!" |
|
sentences = segmenter.segment_to_sentences(text) |
|
print(sentences) |
|
# Output: ['This is a test sentence.', " Here's another one!"] |
|
|
|
# Segment to spans |
|
sentence_spans = segmenter.get_sentence_spans(text) |
|
print(sentence_spans) |
|
# Output: [(0, 24), (24, 44)] |
|
``` |
|
|
|
## Performance |
|
|
|
ONNX models provide significantly faster inference compared to the standard scikit-learn models |
|
while maintaining the same accuracy metrics. The performance differences between model sizes are shown below. |
|
|
|
### Base Model Performance |
|
|
|
| Dataset | Precision | F1 | Recall | |
|
|---------|-----------|-------|--------| |
|
| ALEA SBD Benchmark | 0.637 | 0.727 | 0.847 | |
|
| SCOTUS | 0.950 | 0.778 | 0.658 | |
|
| Cyber Crime | 0.968 | 0.853 | 0.762 | |
|
| BVA | 0.963 | 0.881 | 0.813 | |
|
| Intellectual Property | 0.954 | 0.890 | 0.834 | |
|
|
|
### Size and Speed Comparison |
|
|
|
| Model | Format | Size (MB) | Memory Usage | Throughput (chars/sec) | F1 Score | |
|
|-------|--------|-----------|--------------|------------------------|----------| |
|
| Small | [SKOPS](https://huggingface.co/alea-institute/charboundary-small) / [ONNX](https://huggingface.co/alea-institute/charboundary-small-onnx) | 3.0 / 0.5 | 1,026 MB | ~748K | 0.773 | |
|
| Medium | [SKOPS](https://huggingface.co/alea-institute/charboundary-medium) / [ONNX](https://huggingface.co/alea-institute/charboundary-medium-onnx) | 13.0 / 2.6 | 1,897 MB | ~587K | 0.779 | |
|
| Large | [SKOPS](https://huggingface.co/alea-institute/charboundary-large) / [ONNX](https://huggingface.co/alea-institute/charboundary-large-onnx) | 60.0 / 13.0 | 5,734 MB | ~518K | 0.782 | |
|
|
|
## Paper and Citation |
|
|
|
This model is part of the research presented in the following paper: |
|
|
|
``` |
|
@article{bommarito2025precise, |
|
title={Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary}, |
|
author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian}, |
|
journal={arXiv preprint arXiv:2504.04131}, |
|
year={2025} |
|
} |
|
``` |
|
|
|
For more details on the model architecture, training, and evaluation, please see: |
|
- [Paper: "Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary"](https://arxiv.org/abs/2504.04131) |
|
- [CharBoundary GitHub repository](https://github.com/alea-institute/charboundary) |
|
- [Annotated training data](https://huggingface.co/datasets/alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries) |
|
|
|
|
|
## Contact |
|
|
|
This model is developed and maintained by the [ALEA Institute](https://aleainstitute.ai). |
|
|
|
For technical support, collaboration opportunities, or general inquiries: |
|
|
|
- GitHub: https://github.com/alea-institute/kl3m-model-research |
|
- Email: hello@aleainstitute.ai |
|
- Website: https://aleainstitute.ai |
|
|
|
For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [hello@aleainstitute.ai](mailto:hello@aleainstitute.ai) or |
|
create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-model-research). |
|
|
|
 |
|
|