EZ-Tokenizer / examples /README.md
Johnnyman1100's picture
Upload 38 files
4265aea verified
# NexForge Tokenizer Examples
This directory contains example scripts demonstrating advanced usage of the NexForge tokenizer.
## Quick Start
### Basic Tokenizer Creation
```python
from nexforgetokenizer import build_tokenizer
# Create a tokenizer with default settings
build_tokenizer(
input_dir="path/to/your/files",
output_path="custom_tokenizer.json",
vocab_size=40000,
min_frequency=2
)
```
### Example Scripts
1. **Basic Example** (`basic_usage.py`)
- Simple tokenizer creation and usage
- Basic encoding/decoding
- Vocabulary inspection
2. **Advanced Usage** (`advanced_usage.py`)
- Custom special tokens
- Batch processing
- Performance optimization
- Error handling
## Running Examples
```bash
# Install in development mode
pip install -e .
# Run basic example
python examples/basic_usage.py
# Run advanced example
python examples/advanced_usage.py --input-dir ../Dataset --output my_tokenizer.json
```
## Example: Creating a Custom Tokenizer
```python
from nexforgetokenizer import build_tokenizer
# Create a tokenizer with custom settings
build_tokenizer(
input_dir="../Dataset",
output_path="my_tokenizer.json",
vocab_size=30000, # Smaller vocabulary for specific domain
min_frequency=3, # Only include tokens appearing at least 3 times
max_files=1000, # Limit number of files to process
special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]
)
```
## Best Practices
1. **For General Use**
- Use default settings (40k vocab, min_freq=2)
- Process all files in your dataset
- Test with the built-in test suite
2. **For Specialized Domains**
- Adjust vocabulary size based on domain complexity
- Consider increasing min_frequency for smaller vocabularies
- Test with domain-specific files
## Need Help?
- Check the [main README](../README.md) for basic usage
- Review the test cases in `Test_tokenizer/`
- Open an issue on GitHub for support
## License
MIT License - See [LICENSE](../LICENSE) for details.