File size: 2,648 Bytes
9f63df0
f457b4b
 
 
 
 
 
 
95d0205
 
9f63df0
 
3d17c7a
9f63df0
f457b4b
9f63df0
f457b4b
9f63df0
f457b4b
9f63df0
f457b4b
9f63df0
f457b4b
9f63df0
f457b4b
 
9f63df0
f457b4b
9f63df0
f457b4b
9f63df0
f457b4b
 
9f63df0
f457b4b
 
9f63df0
f457b4b
 
 
 
 
9f63df0
f457b4b
 
 
9f63df0
f457b4b
 
 
 
9f63df0
f457b4b
 
 
9f63df0
f457b4b
9f63df0
f457b4b
9f63df0
f457b4b
9f63df0
f457b4b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
---
language: en
license: mit
tags:
- tokenizer
- python
- code-search-net
- bpe
library_name: transformers
base_model: gpt2
---

# Tokenizer for Python Code (Trained on CodeSearchNet)

## Model Description

This is a custom Byte-Pair Encoding (BPE) tokenizer, initialized from a `gpt2` tokenizer and further trained on the Python subset of the [CodeSearchNet dataset](https://huggingface.co/datasets/claudios/code_search_net). The tokenizer is designed to efficiently tokenize Python code, which can be useful for various downstream tasks like code generation, code completion, and code analysis.

## Training Data

The tokenizer was trained on the `whole_func_string` column of the `train` split from the `claudios/code_search_net` dataset, specifically focusing on Python code examples. The training corpus consisted of approximately 412,178 Python function strings.

## Training Procedure

1.  **Base Tokenizer**: Started with a pre-trained `gpt2` tokenizer.
2.  **Training**: The `train_new_from_iterator` method from `transformers.PreTrainedTokenizerFast` was used to train a new vocabulary and merges from the `CodeSearchNet` Python code corpus. The new vocabulary size was set to 52,000 tokens.

## How to Use

You can load and use this tokenizer with the `transformers` library:

```python
from transformers import AutoTokenizer

# Load the tokenizer from the Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained("rajaykumar12959/new_tokeniser")

# Example usage
example_code = """class LinearLayer():
    def __init__(self, input_size, output_size):
        self.weight = torch.randn(input_size, output_size)
        self.bias = torch.zeros(output_size)

    def __call__(self, x):
        return x @ self.weights + self.bias
    """

tokens = tokenizer.tokenize(example_code)
print(tokens)
# Output will be similar to:
# ['class', 'ĠLinear', 'Layer', '():', 'ĊĠĠĠ', 'Ġdef', 'Ġ__', 'init', '__(', 'self', ',', 'Ġinput', '_', 'size', ',', 'Ġoutput', '_', 'size', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'weight', 'Ġ=', 'Ġtorch', '.', 'randn', '(', 'input', '_', 'size', ',', 'Ġoutput', '_', 'size', ')', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'bias', 'Ġ=', 'Ġtorch', '.', 'zeros', '(', 'output', '_', 'size', ')', 'ĊĊĠĠĠ', 'Ġdef', 'Ġ__', 'call', '__(', 'self', ',', 'Ġx', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġreturn', 'Ġx', 'Ġ@', 'Ġself', '.', 'weights', 'Ġ+', 'Ġself', '.', 'bias', 'ĊĠĠĠĠ']

encoded_input = tokenizer(example_code, return_tensors="pt")
print(encoded_input)
```

## License

This tokenizer is licensed under the MIT License.

## Author

[rajaykumar12959](https://huggingface.co/rajaykumar12959)