Tokenizer for Python Code (Trained on CodeSearchNet)
Model Description
This is a custom Byte-Pair Encoding (BPE) tokenizer, initialized from a gpt2 tokenizer and further trained on the Python subset of the CodeSearchNet dataset. The tokenizer is designed to efficiently tokenize Python code, which can be useful for various downstream tasks like code generation, code completion, and code analysis.
Training Data
The tokenizer was trained on the whole_func_string column of the train split from the claudios/code_search_net dataset, specifically focusing on Python code examples. The training corpus consisted of approximately 412,178 Python function strings.
Training Procedure
- Base Tokenizer: Started with a pre-trained
gpt2tokenizer. - Training: The
train_new_from_iteratormethod fromtransformers.PreTrainedTokenizerFastwas used to train a new vocabulary and merges from theCodeSearchNetPython code corpus. The new vocabulary size was set to 52,000 tokens.
How to Use
You can load and use this tokenizer with the transformers library:
from transformers import AutoTokenizer
# Load the tokenizer from the Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained("rajaykumar12959/new_tokeniser")
# Example usage
example_code = """class LinearLayer():
def __init__(self, input_size, output_size):
self.weight = torch.randn(input_size, output_size)
self.bias = torch.zeros(output_size)
def __call__(self, x):
return x @ self.weights + self.bias
"""
tokens = tokenizer.tokenize(example_code)
print(tokens)
# Output will be similar to:
# ['class', 'ĠLinear', 'Layer', '():', 'ĊĠĠĠ', 'Ġdef', 'Ġ__', 'init', '__(', 'self', ',', 'Ġinput', '_', 'size', ',', 'Ġoutput', '_', 'size', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'weight', 'Ġ=', 'Ġtorch', '.', 'randn', '(', 'input', '_', 'size', ',', 'Ġoutput', '_', 'size', ')', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'bias', 'Ġ=', 'Ġtorch', '.', 'zeros', '(', 'output', '_', 'size', ')', 'ĊĊĠĠĠ', 'Ġdef', 'Ġ__', 'call', '__(', 'self', ',', 'Ġx', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġreturn', 'Ġx', 'Ġ@', 'Ġself', '.', 'weights', 'Ġ+', 'Ġself', '.', 'bias', 'ĊĠĠĠĠ']
encoded_input = tokenizer(example_code, return_tensors="pt")
print(encoded_input)
License
This tokenizer is licensed under the MIT License.
Author
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for rajaykumar12959/python-codesearchnet-tokenizer
Base model
openai-community/gpt2