| language: en | |
| license: mit | |
| tags: | |
| - tokenizer | |
| - python | |
| - code-search-net | |
| - bpe | |
| library_name: transformers | |
| base_model: gpt2 | |
| # Tokenizer for Python Code (Trained on CodeSearchNet) | |
| ## Model Description | |
| This is a custom Byte-Pair Encoding (BPE) tokenizer, initialized from a `gpt2` tokenizer and further trained on the Python subset of the [CodeSearchNet dataset](https://huggingface.co/datasets/claudios/code_search_net). The tokenizer is designed to efficiently tokenize Python code, which can be useful for various downstream tasks like code generation, code completion, and code analysis. | |
| ## Training Data | |
| The tokenizer was trained on the `whole_func_string` column of the `train` split from the `claudios/code_search_net` dataset, specifically focusing on Python code examples. The training corpus consisted of approximately 412,178 Python function strings. | |
| ## Training Procedure | |
| 1. **Base Tokenizer**: Started with a pre-trained `gpt2` tokenizer. | |
| 2. **Training**: The `train_new_from_iterator` method from `transformers.PreTrainedTokenizerFast` was used to train a new vocabulary and merges from the `CodeSearchNet` Python code corpus. The new vocabulary size was set to 52,000 tokens. | |
| ## How to Use | |
| You can load and use this tokenizer with the `transformers` library: | |
| ```python | |
| from transformers import AutoTokenizer | |
| # Load the tokenizer from the Hugging Face Hub | |
| tokenizer = AutoTokenizer.from_pretrained("rajaykumar12959/new_tokeniser") | |
| # Example usage | |
| example_code = """class LinearLayer(): | |
| def __init__(self, input_size, output_size): | |
| self.weight = torch.randn(input_size, output_size) | |
| self.bias = torch.zeros(output_size) | |
| def __call__(self, x): | |
| return x @ self.weights + self.bias | |
| """ | |
| tokens = tokenizer.tokenize(example_code) | |
| print(tokens) | |
| # Output will be similar to: | |
| # ['class', 'ĠLinear', 'Layer', '():', 'ĊĠĠĠ', 'Ġdef', 'Ġ__', 'init', '__(', 'self', ',', 'Ġinput', '_', 'size', ',', 'Ġoutput', '_', 'size', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'weight', 'Ġ=', 'Ġtorch', '.', 'randn', '(', 'input', '_', 'size', ',', 'Ġoutput', '_', 'size', ')', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'bias', 'Ġ=', 'Ġtorch', '.', 'zeros', '(', 'output', '_', 'size', ')', 'ĊĊĠĠĠ', 'Ġdef', 'Ġ__', 'call', '__(', 'self', ',', 'Ġx', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġreturn', 'Ġx', 'Ġ@', 'Ġself', '.', 'weights', 'Ġ+', 'Ġself', '.', 'bias', 'ĊĠĠĠĠ'] | |
| encoded_input = tokenizer(example_code, return_tensors="pt") | |
| print(encoded_input) | |
| ``` | |
| ## License | |
| This tokenizer is licensed under the MIT License. | |
| ## Author | |
| [rajaykumar12959](https://huggingface.co/rajaykumar12959) | |