|
# NexForge Tokenizer Examples |
|
|
|
This directory contains example scripts demonstrating advanced usage of the NexForge tokenizer. |
|
|
|
## Quick Start |
|
|
|
### Basic Tokenizer Creation |
|
|
|
```python |
|
from nexforgetokenizer import build_tokenizer |
|
|
|
# Create a tokenizer with default settings |
|
build_tokenizer( |
|
input_dir="path/to/your/files", |
|
output_path="custom_tokenizer.json", |
|
vocab_size=40000, |
|
min_frequency=2 |
|
) |
|
``` |
|
|
|
### Example Scripts |
|
|
|
1. **Basic Example** (`basic_usage.py`) |
|
- Simple tokenizer creation and usage |
|
- Basic encoding/decoding |
|
- Vocabulary inspection |
|
|
|
2. **Advanced Usage** (`advanced_usage.py`) |
|
- Custom special tokens |
|
- Batch processing |
|
- Performance optimization |
|
- Error handling |
|
|
|
## Running Examples |
|
|
|
```bash |
|
# Install in development mode |
|
pip install -e . |
|
|
|
# Run basic example |
|
python examples/basic_usage.py |
|
|
|
# Run advanced example |
|
python examples/advanced_usage.py --input-dir ../Dataset --output my_tokenizer.json |
|
``` |
|
|
|
## Example: Creating a Custom Tokenizer |
|
|
|
```python |
|
from nexforgetokenizer import build_tokenizer |
|
|
|
# Create a tokenizer with custom settings |
|
build_tokenizer( |
|
input_dir="../Dataset", |
|
output_path="my_tokenizer.json", |
|
vocab_size=30000, # Smaller vocabulary for specific domain |
|
min_frequency=3, # Only include tokens appearing at least 3 times |
|
max_files=1000, # Limit number of files to process |
|
special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"] |
|
) |
|
``` |
|
|
|
## Best Practices |
|
|
|
1. **For General Use** |
|
- Use default settings (40k vocab, min_freq=2) |
|
- Process all files in your dataset |
|
- Test with the built-in test suite |
|
|
|
2. **For Specialized Domains** |
|
- Adjust vocabulary size based on domain complexity |
|
- Consider increasing min_frequency for smaller vocabularies |
|
- Test with domain-specific files |
|
|
|
## Need Help? |
|
|
|
- Check the [main README](../README.md) for basic usage |
|
- Review the test cases in `Test_tokenizer/` |
|
- Open an issue on GitHub for support |
|
|
|
## License |
|
|
|
MIT License - See [LICENSE](../LICENSE) for details. |
|
|