rajaykumar12959
/

python-codesearchnet-tokenizer

code-search-net

Model card Files Files and versions

python-codesearchnet-tokenizer / README.md

rajaykumar12959's picture

rajaykumar12959

Update README.md

3d17c7a verified 6 days ago

|

history blame contribute delete

2.65 kB

	---
	language: en
	license: mit
	tags:
	- tokenizer
	- python
	- code-search-net
	- bpe
	library_name: transformers
	base_model: gpt2
	---

	# Tokenizer for Python Code (Trained on CodeSearchNet)

	## Model Description

	This is a custom Byte-Pair Encoding (BPE) tokenizer, initialized from a `gpt2` tokenizer and further trained on the Python subset of the [CodeSearchNet dataset](https://huggingface.co/datasets/claudios/code_search_net). The tokenizer is designed to efficiently tokenize Python code, which can be useful for various downstream tasks like code generation, code completion, and code analysis.

	## Training Data

	The tokenizer was trained on the `whole_func_string` column of the `train` split from the `claudios/code_search_net` dataset, specifically focusing on Python code examples. The training corpus consisted of approximately 412,178 Python function strings.

	## Training Procedure

	1. Base Tokenizer: Started with a pre-trained `gpt2` tokenizer.
	2. Training: The `train_new_from_iterator` method from `transformers.PreTrainedTokenizerFast` was used to train a new vocabulary and merges from the `CodeSearchNet` Python code corpus. The new vocabulary size was set to 52,000 tokens.

	## How to Use

	You can load and use this tokenizer with the `transformers` library:

	```python
	from transformers import AutoTokenizer

	# Load the tokenizer from the Hugging Face Hub
	tokenizer = AutoTokenizer.from_pretrained("rajaykumar12959/new_tokeniser")

	# Example usage
	example_code = """class LinearLayer():
	def __init__(self, input_size, output_size):
	self.weight = torch.randn(input_size, output_size)
	self.bias = torch.zeros(output_size)

	def __call__(self, x):
	return x @ self.weights + self.bias
	"""

	tokens = tokenizer.tokenize(example_code)
	print(tokens)
	# Output will be similar to:
	# ['class', 'ĠLinear', 'Layer', '():', 'ĊĠĠĠ', 'Ġdef', 'Ġ__', 'init', '__(', 'self', ',', 'Ġinput', '_', 'size', ',', 'Ġoutput', '_', 'size', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'weight', 'Ġ=', 'Ġtorch', '.', 'randn', '(', 'input', '_', 'size', ',', 'Ġoutput', '_', 'size', ')', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'bias', 'Ġ=', 'Ġtorch', '.', 'zeros', '(', 'output', '_', 'size', ')', 'ĊĊĠĠĠ', 'Ġdef', 'Ġ__', 'call', '__(', 'self', ',', 'Ġx', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġreturn', 'Ġx', 'Ġ@', 'Ġself', '.', 'weights', 'Ġ+', 'Ġself', '.', 'bias', 'ĊĠĠĠĠ']

	encoded_input = tokenizer(example_code, return_tensors="pt")
	print(encoded_input)
	```

	## License

	This tokenizer is licensed under the MIT License.

	## Author

	[rajaykumar12959](https://huggingface.co/rajaykumar12959)