Enter a valid HuggingFace model ID (e.g., "mistralai/Mistral-7B-Instruct-v0.3")
The model must have a tokenizer available and must be not restricted. (with some exceptions)
Also some models have restrictions. You can use mirrored versions, like unsloth to omit that.
Like ("unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit") instead of original path.
Loaded
{{ error }}
Ctrl+Enter
Token Visualization
0/0
Note: Showing preview of first 8096 characters. Stats are calculated on the full file.
{% if token_data %}
{% for token in token_data.tokens %}
{{ token.display }}
{% if token.newline %} {% endif %}
{% endfor %}
{% endif %}
Note: Only showing first 50,000 tokens. Total token count: 0
Top Token Frequencies
Total Tokens
{{ token_data.stats.basic_stats.total_tokens if token_data else 0 }}
{{ token_data.stats.basic_stats.unique_tokens if token_data else 0 }} unique
({{ token_data.stats.basic_stats.unique_percentage if token_data else 0 }}%)
Token Types
{{ token_data.stats.basic_stats.special_tokens if token_data else 0 }}
special tokens
Whitespace
{{ token_data.stats.basic_stats.space_tokens if token_data else 0 }}
spaces: {{ token_data.stats.basic_stats.space_tokens if token_data else 0 }},
newlines: {{ token_data.stats.basic_stats.newline_tokens if token_data else 0 }}
Token Length
{{ token_data.stats.length_stats.avg_length if token_data else 0 }}
median: {{ token_data.stats.length_stats.median_length if token_data else 0 }},
±{{ token_data.stats.length_stats.std_dev if token_data else 0 }} std
Compression
{{ token_data.stats.basic_stats.compression_ratio if token_data else 0 }}