EZ-Tokenizer / examples /README.md
Johnnyman1100's picture
Upload 38 files
4265aea verified

NexForge Tokenizer Examples

This directory contains example scripts demonstrating advanced usage of the NexForge tokenizer.

Quick Start

Basic Tokenizer Creation

from nexforgetokenizer import build_tokenizer

# Create a tokenizer with default settings
build_tokenizer(
    input_dir="path/to/your/files",
    output_path="custom_tokenizer.json",
    vocab_size=40000,
    min_frequency=2
)

Example Scripts

  1. Basic Example (basic_usage.py)

    • Simple tokenizer creation and usage
    • Basic encoding/decoding
    • Vocabulary inspection
  2. Advanced Usage (advanced_usage.py)

    • Custom special tokens
    • Batch processing
    • Performance optimization
    • Error handling

Running Examples

# Install in development mode
pip install -e .

# Run basic example
python examples/basic_usage.py

# Run advanced example
python examples/advanced_usage.py --input-dir ../Dataset --output my_tokenizer.json

Example: Creating a Custom Tokenizer

from nexforgetokenizer import build_tokenizer

# Create a tokenizer with custom settings
build_tokenizer(
    input_dir="../Dataset",
    output_path="my_tokenizer.json",
    vocab_size=30000,      # Smaller vocabulary for specific domain
    min_frequency=3,        # Only include tokens appearing at least 3 times
    max_files=1000,         # Limit number of files to process
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]
)

Best Practices

  1. For General Use

    • Use default settings (40k vocab, min_freq=2)
    • Process all files in your dataset
    • Test with the built-in test suite
  2. For Specialized Domains

    • Adjust vocabulary size based on domain complexity
    • Consider increasing min_frequency for smaller vocabularies
    • Test with domain-specific files

Need Help?

  • Check the main README for basic usage
  • Review the test cases in Test_tokenizer/
  • Open an issue on GitHub for support

License

MIT License - See LICENSE for details.