File size: 2,043 Bytes
4265aea |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
# NexForge Tokenizer Examples
This directory contains example scripts demonstrating advanced usage of the NexForge tokenizer.
## Quick Start
### Basic Tokenizer Creation
```python
from nexforgetokenizer import build_tokenizer
# Create a tokenizer with default settings
build_tokenizer(
input_dir="path/to/your/files",
output_path="custom_tokenizer.json",
vocab_size=40000,
min_frequency=2
)
```
### Example Scripts
1. **Basic Example** (`basic_usage.py`)
- Simple tokenizer creation and usage
- Basic encoding/decoding
- Vocabulary inspection
2. **Advanced Usage** (`advanced_usage.py`)
- Custom special tokens
- Batch processing
- Performance optimization
- Error handling
## Running Examples
```bash
# Install in development mode
pip install -e .
# Run basic example
python examples/basic_usage.py
# Run advanced example
python examples/advanced_usage.py --input-dir ../Dataset --output my_tokenizer.json
```
## Example: Creating a Custom Tokenizer
```python
from nexforgetokenizer import build_tokenizer
# Create a tokenizer with custom settings
build_tokenizer(
input_dir="../Dataset",
output_path="my_tokenizer.json",
vocab_size=30000, # Smaller vocabulary for specific domain
min_frequency=3, # Only include tokens appearing at least 3 times
max_files=1000, # Limit number of files to process
special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]
)
```
## Best Practices
1. **For General Use**
- Use default settings (40k vocab, min_freq=2)
- Process all files in your dataset
- Test with the built-in test suite
2. **For Specialized Domains**
- Adjust vocabulary size based on domain complexity
- Consider increasing min_frequency for smaller vocabularies
- Test with domain-specific files
## Need Help?
- Check the [main README](../README.md) for basic usage
- Review the test cases in `Test_tokenizer/`
- Open an issue on GitHub for support
## License
MIT License - See [LICENSE](../LICENSE) for details.
|