rajaykumar12959's picture
Update README.md
3d17c7a verified
---
language: en
license: mit
tags:
- tokenizer
- python
- code-search-net
- bpe
library_name: transformers
base_model: gpt2
---
# Tokenizer for Python Code (Trained on CodeSearchNet)
## Model Description
This is a custom Byte-Pair Encoding (BPE) tokenizer, initialized from a `gpt2` tokenizer and further trained on the Python subset of the [CodeSearchNet dataset](https://huggingface.co/datasets/claudios/code_search_net). The tokenizer is designed to efficiently tokenize Python code, which can be useful for various downstream tasks like code generation, code completion, and code analysis.
## Training Data
The tokenizer was trained on the `whole_func_string` column of the `train` split from the `claudios/code_search_net` dataset, specifically focusing on Python code examples. The training corpus consisted of approximately 412,178 Python function strings.
## Training Procedure
1. **Base Tokenizer**: Started with a pre-trained `gpt2` tokenizer.
2. **Training**: The `train_new_from_iterator` method from `transformers.PreTrainedTokenizerFast` was used to train a new vocabulary and merges from the `CodeSearchNet` Python code corpus. The new vocabulary size was set to 52,000 tokens.
## How to Use
You can load and use this tokenizer with the `transformers` library:
```python
from transformers import AutoTokenizer
# Load the tokenizer from the Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained("rajaykumar12959/new_tokeniser")
# Example usage
example_code = """class LinearLayer():
def __init__(self, input_size, output_size):
self.weight = torch.randn(input_size, output_size)
self.bias = torch.zeros(output_size)
def __call__(self, x):
return x @ self.weights + self.bias
"""
tokens = tokenizer.tokenize(example_code)
print(tokens)
# Output will be similar to:
# ['class', 'ĠLinear', 'Layer', '():', 'ĊĠĠĠ', 'Ġdef', 'Ġ__', 'init', '__(', 'self', ',', 'Ġinput', '_', 'size', ',', 'Ġoutput', '_', 'size', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'weight', 'Ġ=', 'Ġtorch', '.', 'randn', '(', 'input', '_', 'size', ',', 'Ġoutput', '_', 'size', ')', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'bias', 'Ġ=', 'Ġtorch', '.', 'zeros', '(', 'output', '_', 'size', ')', 'ĊĊĠĠĠ', 'Ġdef', 'Ġ__', 'call', '__(', 'self', ',', 'Ġx', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġreturn', 'Ġx', 'Ġ@', 'Ġself', '.', 'weights', 'Ġ+', 'Ġself', '.', 'bias', 'ĊĠĠĠĠ']
encoded_input = tokenizer(example_code, return_tensors="pt")
print(encoded_input)
```
## License
This tokenizer is licensed under the MIT License.
## Author
[rajaykumar12959](https://huggingface.co/rajaykumar12959)