File size: 2,648 Bytes
9f63df0 f457b4b 95d0205 9f63df0 3d17c7a 9f63df0 f457b4b 9f63df0 f457b4b 9f63df0 f457b4b 9f63df0 f457b4b 9f63df0 f457b4b 9f63df0 f457b4b 9f63df0 f457b4b 9f63df0 f457b4b 9f63df0 f457b4b 9f63df0 f457b4b 9f63df0 f457b4b 9f63df0 f457b4b 9f63df0 f457b4b 9f63df0 f457b4b 9f63df0 f457b4b 9f63df0 f457b4b 9f63df0 f457b4b 9f63df0 f457b4b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
---
language: en
license: mit
tags:
- tokenizer
- python
- code-search-net
- bpe
library_name: transformers
base_model: gpt2
---
# Tokenizer for Python Code (Trained on CodeSearchNet)
## Model Description
This is a custom Byte-Pair Encoding (BPE) tokenizer, initialized from a `gpt2` tokenizer and further trained on the Python subset of the [CodeSearchNet dataset](https://huggingface.co/datasets/claudios/code_search_net). The tokenizer is designed to efficiently tokenize Python code, which can be useful for various downstream tasks like code generation, code completion, and code analysis.
## Training Data
The tokenizer was trained on the `whole_func_string` column of the `train` split from the `claudios/code_search_net` dataset, specifically focusing on Python code examples. The training corpus consisted of approximately 412,178 Python function strings.
## Training Procedure
1. **Base Tokenizer**: Started with a pre-trained `gpt2` tokenizer.
2. **Training**: The `train_new_from_iterator` method from `transformers.PreTrainedTokenizerFast` was used to train a new vocabulary and merges from the `CodeSearchNet` Python code corpus. The new vocabulary size was set to 52,000 tokens.
## How to Use
You can load and use this tokenizer with the `transformers` library:
```python
from transformers import AutoTokenizer
# Load the tokenizer from the Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained("rajaykumar12959/new_tokeniser")
# Example usage
example_code = """class LinearLayer():
def __init__(self, input_size, output_size):
self.weight = torch.randn(input_size, output_size)
self.bias = torch.zeros(output_size)
def __call__(self, x):
return x @ self.weights + self.bias
"""
tokens = tokenizer.tokenize(example_code)
print(tokens)
# Output will be similar to:
# ['class', 'ĠLinear', 'Layer', '():', 'ĊĠĠĠ', 'Ġdef', 'Ġ__', 'init', '__(', 'self', ',', 'Ġinput', '_', 'size', ',', 'Ġoutput', '_', 'size', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'weight', 'Ġ=', 'Ġtorch', '.', 'randn', '(', 'input', '_', 'size', ',', 'Ġoutput', '_', 'size', ')', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'bias', 'Ġ=', 'Ġtorch', '.', 'zeros', '(', 'output', '_', 'size', ')', 'ĊĊĠĠĠ', 'Ġdef', 'Ġ__', 'call', '__(', 'self', ',', 'Ġx', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġreturn', 'Ġx', 'Ġ@', 'Ġself', '.', 'weights', 'Ġ+', 'Ġself', '.', 'bias', 'ĊĠĠĠĠ']
encoded_input = tokenizer(example_code, return_tensors="pt")
print(encoded_input)
```
## License
This tokenizer is licensed under the MIT License.
## Author
[rajaykumar12959](https://huggingface.co/rajaykumar12959)
|