NexForge Tokenizer Examples
This directory contains example scripts demonstrating advanced usage of the NexForge tokenizer.
Quick Start
Basic Tokenizer Creation
from nexforgetokenizer import build_tokenizer
# Create a tokenizer with default settings
build_tokenizer(
input_dir="path/to/your/files",
output_path="custom_tokenizer.json",
vocab_size=40000,
min_frequency=2
)
Example Scripts
Basic Example (
basic_usage.py
)- Simple tokenizer creation and usage
- Basic encoding/decoding
- Vocabulary inspection
Advanced Usage (
advanced_usage.py
)- Custom special tokens
- Batch processing
- Performance optimization
- Error handling
Running Examples
# Install in development mode
pip install -e .
# Run basic example
python examples/basic_usage.py
# Run advanced example
python examples/advanced_usage.py --input-dir ../Dataset --output my_tokenizer.json
Example: Creating a Custom Tokenizer
from nexforgetokenizer import build_tokenizer
# Create a tokenizer with custom settings
build_tokenizer(
input_dir="../Dataset",
output_path="my_tokenizer.json",
vocab_size=30000, # Smaller vocabulary for specific domain
min_frequency=3, # Only include tokens appearing at least 3 times
max_files=1000, # Limit number of files to process
special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]
)
Best Practices
For General Use
- Use default settings (40k vocab, min_freq=2)
- Process all files in your dataset
- Test with the built-in test suite
For Specialized Domains
- Adjust vocabulary size based on domain complexity
- Consider increasing min_frequency for smaller vocabularies
- Test with domain-specific files
Need Help?
- Check the main README for basic usage
- Review the test cases in
Test_tokenizer/
- Open an issue on GitHub for support
License
MIT License - See LICENSE for details.