NexForge Tokenizer Examples

This directory contains example scripts demonstrating advanced usage of the NexForge tokenizer.

Quick Start

Basic Tokenizer Creation

from nexforgetokenizer import build_tokenizer

# Create a tokenizer with default settings
build_tokenizer(
    input_dir="path/to/your/files",
    output_path="custom_tokenizer.json",
    vocab_size=40000,
    min_frequency=2
)

Example Scripts

Basic Example (basic_usage.py)
- Simple tokenizer creation and usage
- Basic encoding/decoding
- Vocabulary inspection
Advanced Usage (advanced_usage.py)
- Custom special tokens
- Batch processing
- Performance optimization
- Error handling

Running Examples

# Install in development mode
pip install -e .

# Run basic example
python examples/basic_usage.py

# Run advanced example
python examples/advanced_usage.py --input-dir ../Dataset --output my_tokenizer.json

Example: Creating a Custom Tokenizer

from nexforgetokenizer import build_tokenizer

# Create a tokenizer with custom settings
build_tokenizer(
    input_dir="../Dataset",
    output_path="my_tokenizer.json",
    vocab_size=30000,      # Smaller vocabulary for specific domain
    min_frequency=3,        # Only include tokens appearing at least 3 times
    max_files=1000,         # Limit number of files to process
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]
)

Best Practices

For General Use
- Use default settings (40k vocab, min_freq=2)
- Process all files in your dataset
- Test with the built-in test suite
For Specialized Domains
- Adjust vocabulary size based on domain complexity
- Consider increasing min_frequency for smaller vocabularies
- Test with domain-specific files

Need Help?

Check the main README for basic usage
Review the test cases in Test_tokenizer/
Open an issue on GitHub for support

License

MIT License - See LICENSE for details.