Update README for large ONNX model

0e8e44c verified 5 months ago

4.95 kB

	---
	language:
	- en
	tags:
	- charboundary
	- sentence-boundary-detection
	- paragraph-detection
	- legal-text
	- legal-nlp
	- text-segmentation
	- onnx
	- cpu
	- document-processing
	- rag
	- optimized-inference
	license: mit
	library_name: charboundary
	pipeline_tag: text-classification
	datasets:
	- alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries
	- alea-institute/kl3m-data-snapshot-20250324
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	- throughput
	papers:
	- https://arxiv.org/abs/2504.04131
	---

	# CharBoundary large ONNX Model

	This is the large ONNX model for the [CharBoundary](https://github.com/alea-institute/charboundary) library (v0.5.0),
	a fast character-based sentence and paragraph boundary detection system optimized for legal text.

	## Model Details

	- Size: large
	- Model Size: 12.0 MB (ONNX compressed)
	- Memory Usage: 5734 MB at runtime (non-ONNX version)
	- Training Data: Legal text with ~5,000,000 samples from [KL3M dataset](https://huggingface.co/datasets/alea-institute/kl3m-data-snapshot-20250324)
	- Model Type: Random Forest (100 trees, max depth 24) converted to ONNX
	- Format: ONNX optimized for inference
	- Task: Character-level boundary detection for text segmentation
	- License: MIT
	- Throughput: ~518K characters/second (base model; ONNX is typically 2-4x faster)

	## Usage

	> Security Advantage: This ONNX model format provides enhanced security compared to SKOPS models, as it doesn't require bypassing security measures with `trust_model=True`. ONNX models are the recommended option for security-sensitive environments.

	```python
	# Make sure to install with the onnx extra to get ONNX runtime support
	# pip install charboundary[onnx]
	from charboundary import get_large_onnx_segmenter

	# First load can be slow
	segmenter = get_large_onnx_segmenter()

	# Use the model
	text = "This is a test sentence. Here's another one!"
	sentences = segmenter.segment_to_sentences(text)
	print(sentences)
	# Output: ['This is a test sentence.', " Here's another one!"]

	# Segment to spans
	sentence_spans = segmenter.get_sentence_spans(text)
	print(sentence_spans)
	# Output: [(0, 24), (24, 44)]
	```

	## Performance

	ONNX models provide significantly faster inference compared to the standard scikit-learn models
	while maintaining the same accuracy metrics. The performance differences between model sizes are shown below.

	### Base Model Performance

	\| Dataset \| Precision \| F1 \| Recall \|
	\|---------\|-----------\|-------\|--------\|
	\| ALEA SBD Benchmark \| 0.637 \| 0.727 \| 0.847 \|
	\| SCOTUS \| 0.950 \| 0.778 \| 0.658 \|
	\| Cyber Crime \| 0.968 \| 0.853 \| 0.762 \|
	\| BVA \| 0.963 \| 0.881 \| 0.813 \|
	\| Intellectual Property \| 0.954 \| 0.890 \| 0.834 \|

	### Size and Speed Comparison

	\| Model \| Format \| Size (MB) \| Memory Usage \| Throughput (chars/sec) \| F1 Score \|
	\|-------\|--------\|-----------\|--------------\|------------------------\|----------\|
	\| Small \| [SKOPS](https://huggingface.co/alea-institute/charboundary-small) / [ONNX](https://huggingface.co/alea-institute/charboundary-small-onnx) \| 3.0 / 0.5 \| 1,026 MB \| ~748K \| 0.773 \|
	\| Medium \| [SKOPS](https://huggingface.co/alea-institute/charboundary-medium) / [ONNX](https://huggingface.co/alea-institute/charboundary-medium-onnx) \| 13.0 / 2.6 \| 1,897 MB \| ~587K \| 0.779 \|
	\| Large \| [SKOPS](https://huggingface.co/alea-institute/charboundary-large) / [ONNX](https://huggingface.co/alea-institute/charboundary-large-onnx) \| 60.0 / 13.0 \| 5,734 MB \| ~518K \| 0.782 \|

	## Paper and Citation

	This model is part of the research presented in the following paper:

	```
	@article{bommarito2025precise,
	title={Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary},
	author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian},
	journal={arXiv preprint arXiv:2504.04131},
	year={2025}
	}
	```

	For more details on the model architecture, training, and evaluation, please see:
	- [Paper: "Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary"](https://arxiv.org/abs/2504.04131)
	- [CharBoundary GitHub repository](https://github.com/alea-institute/charboundary)
	- [Annotated training data](https://huggingface.co/datasets/alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries)


	## Contact

	This model is developed and maintained by the [ALEA Institute](https://aleainstitute.ai).

	For technical support, collaboration opportunities, or general inquiries:

	- GitHub: https://github.com/alea-institute/kl3m-model-research
	- Email: hello@aleainstitute.ai
	- Website: https://aleainstitute.ai

	For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [hello@aleainstitute.ai](mailto:hello@aleainstitute.ai) or
	create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-model-research).

	![https://aleainstitute.ai](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)