VeriDebug / README.md

WANG Ning

Create README.md

d327c9b verified 11 months ago

11.8 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- WANGNingroci/VeriSeek
	---

	Usage:

	```python
	import re
	import json
	import numpy as np
	from src.gritlm import GritLM
	from scipy.spatial.distance import cosine

	KEY_WORDS = ['endmodule', 'end', 'endcase', 'else', 'begin']
	REP_QUERY = 'Represent this text: '
	LINE_QUERY = """Now you are a verilog designer. You are given the design description and buggy verilog code segment. Infer the bug type in the code segment."""
	TYPE_QUERY = "Now you are a verilog designer. You are given the design description and buggy verilog code segment. Infer the bug type in the code segment."
	CLS_QUERY = 'Now you are a verilog designer. You are given the design description and buggy verilog code segment. Infer the bug type in the code segment.\n'
	CLS_DESC = 'The bug type is '
	BUG_CLS = {
	'width': 0, 'logic': 0, 'assignment': 0, 'initial': 0, 'data': 0,
	'state': 0, 'others': 0, 'comparison': 0, 'bitwise': 0, 'condition': 0,
	'signal': 0, 'arithmetic': 0, 'value': 0
	}
	BUG_DESC = {
	'width': 'Mismatched bit widths in assignments, operations, or port connections, leading to unintended truncation or zero-extension.',
	'logic': 'Errors in combinational or sequential logic design, resulting in incorrect circuit behavior or timing issues.',
	'assignment': 'Improper use of blocking (=) or non-blocking (<=) assignments, causing race conditions or unexpected signal updates.',
	'initial': 'Incorrect initialization of variables or registers, leading to undefined behavior or simulation mismatches.',
	'data': 'Errors in data handling, such as incorrect data types, improper conversions, or misuse of signed/unsigned values.',
	'state': 'Flaws in finite state machine (FSM) design, including missing states, incorrect transitions, or improper state encoding.',
	'others': 'Miscellaneous errors that don\'t fit into other categories, such as syntax errors or tool-specific issues.',
	'comparison': 'Incorrect use of equality (==) or inequality (!=) operators, or misuse of case equality (===) and case inequality (!==).',
	'bitwise': 'Errors in bitwise operations, including incorrect use of AND, OR, XOR, or shift operators.',
	'condition': 'Flaws in conditional statements (if-else, case) leading to incorrect branching or priority encoding issues.',
	'signal': 'Errors related to signal declarations, including incorrect use of wire/reg, input/output ports, or signal naming conflicts.',
	'arithmetic': 'Mistakes in arithmetic operations, such as overflow/underflow issues or incorrect use of signed/unsigned arithmetic.',
	'value': 'Incorrect constant values, parameter definitions, or literal representations leading to unexpected circuit behavior.'
	}
	GEN_INST = 'Now you are a verilog designer. You need to fix the bug in the buggy code segment:\n'
	SPEC = """
	---
	### Module Specification: Cfu
	# #### 1. Overview
	# The `Cfu` (Custom Function Unit) module is designed to perform a simple selection operation based on the input command. It processes two 32-bit inputs and outputs one of them based on the least significant bit of the function ID. The module operates synchronously with a clock signal and uses a simple handshake protocol for command acceptance and response delivery.
	# #### 2. Interface Description
	# ##### Inputs:
	# - cmd_valid (`input`): A signal indicating if the command inputs are valid.
	# - cmd_payload_function_id (`input [9:0]`): A 10-bit function identifier which determines the operation of the module. Currently, only the LSB is used for selecting the output.
	# - cmd_payload_inputs_0 (`input [31:0]`): A 32-bit input data.
	# - cmd_payload_inputs_1 (`input [31:0]`): Another 32-bit input data.
	# - rsp_ready (`input`): A signal from the downstream component indicating it is ready to accept the response.
	# - reset (`input`): Asynchronous reset signal.
	# - clk (`input`): Clock signal.
	# ##### Outputs:
	# - cmd_ready (`output`): A signal indicating the module is ready to accept a command.
	- rsp_valid (`output`): A signal indicating that the response is valid and ready to be read.
	# - rsp_payload_outputs_0 (`output [31:0]`): The 32-bit output data, which is one of the two input data values based on the function ID.
	#### 3. Functional Description
	##### Command and Response Protocol:
	- Handshake Mechanism: The module uses a simple handshake mechanism for command acceptance and response delivery. The `cmd_ready` signal is asserted when the module is ready to accept a new command, which depends on the `rsp_ready` signal. The `rsp_valid` signal is asserted when the module has a valid response ready, which is directly tied to the `cmd_valid` signal.

	##### Data Processing:
	- Output Selection: The output, `rsp_payload_outputs_0`, is selected based on the least significant bit (LSB) of `cmd_payload_function_id`. If the LSB is 0, `rsp_payload_outputs_0` is equal to `cmd_payload_inputs_0`. If the LSB is 1, `rsp_payload_outputs_0` is equal to `cmd_payload_inputs_1`.
	#### 4. Timing and Synchronization
	- The module operates synchronously with respect to the provided clock signal (`clk`). All inputs are sampled, and outputs are updated on the rising edge of the clock.
	- The reset (`reset`) is asynchronous and active-high, which means all internal states and outputs are reset when `reset` is asserted, regardless of the clock.
	#### 5. Use Cases
	- Simple Data Selector: This module can be used in systems where conditional data forwarding is needed based on a simple configuration or status bit.
	# - Function ID Expansion: While currently only the LSB of the function ID is used, the module can be expanded to use more bits for more complex selection logic or operations.
	#### 6. Limitations and Future Enhancements
	# - Function ID Utilization: Currently, only the LSB of the function ID is used. Future enhancements could include decoding more bits to perform different operations.
	- Pipeline Stages: The module is purely combinational regarding the data path. Adding pipeline stages could help in meeting timing requirements for higher clock frequencies.
	---
	This specification provides a detailed overview of the `Cfu` module's functionality, interface, and behavior based on the provided Verilog code. It outlines the basic operation, use cases, and potential areas for future enhancements.

	Buggy code:

	"""
	BUGGY = """
	module Cfu (
	input cmd_valid,
	output cmd_ready,
	input [9:0] cmd_payload_function_id,
	input [31:0] cmd_payload_inputs_0,
	input [31:0] cmd_payload_inputs_1,
	output rsp_valid,
	input rsp_ready,
	output [31:0] rsp_payload_outputs_0,
	input reset,
	input clk
	);
	// Trivial handshaking for a combinational CFU
	assign rsp_valid = cmd_valid;
	assign cmd_ready = rsp_ready \| cmd_valid;
	//
	// select output -- note that we're not fully decoding the 3 function_id bits
	//
	assign rsp_payload_outputs_0 = cmd_payload_function_id[0] ?
	cmd_payload_inputs_1 :
	cmd_payload_inputs_0 ;
	endmodule
	"""
	JSON_FORMAT = '{"buggy_code": "The buggy code in the systemverilog (just one line of code)", "correct_code": "The correct code (just one line of code that can directly replace the buggy code, without any other description)"}'
	BUGGY_LINE_GT = "assign cmd_ready = rsp_ready \| cmd_valid;"
	BUGGY_CLS_GT = "logic"
	FIX_GT = "assign cmd_ready = rsp_ready;"


	def gen_neg(buggy_code):
	buggy_code_lines = buggy_code.split('\n')
	buggy_code_lines = [line.strip() for line in buggy_code_lines]
	buggy_code_lines = [line.strip('\t') for line in buggy_code_lines]
	buggy_code_lines = [line.strip('\r') for line in buggy_code_lines]
	buggy_code_lines = [line for line in buggy_code_lines if len(line) > 0]
	# remove comments
	buggy_code_lines_neg = [
	line for line in buggy_code_lines if not line.startswith('//') and not line.startswith('') and not line.startswith('/') and line not in KEY_WORDS]
	# remove not useful lines
	buggy_code_lines_neg = [
	line for line in buggy_code_lines_neg if ' ' in line and len(line.replace(' ', '')) > 4]

	return buggy_code_lines, buggy_code_lines_neg


	def gritlm_instruction(instruction):
	return "<\|user\|>\n" + instruction + "\n<\|embed\|>\n" if instruction else "<\|embed\|>\n"


	def extract_bug_types(text):
	# Define the regex pattern
	pattern = r'The bug type is (\w+)'

	# Find all matches
	matches = re.findall(pattern, text)

	return matches

	# Loads the model for both capabilities; If you only need embedding pass `mode="embedding"` to save memory (no lm head)
	model_path = "./VeriDebug"
	model = GritLM(model_path, torch_dtype="auto", mode="unified")
	print(f"Model loaded from {model_path}")

	### Embedding/Representation ###
	# buggy line location
	_, buggy_code_lines = gen_neg(BUGGY)
	query = [LINE_QUERY + "\n" + SPEC + "\n" + BUGGY]
	q_rep = model.encode(query,
	instruction=gritlm_instruction("Represent this text:"), max_length=4096)
	d_rep = model.encode(buggy_code_lines,
	instruction=gritlm_instruction(""),
	max_length=128)
	cosine_sim = [1 - cosine(q_rep[0], d) for d in d_rep]
	sim_rank = np.argsort(cosine_sim)[::-1]
	buggy_code_lines_ranked = [buggy_code_lines[i] for i in sim_rank]
	print("========== Buggy code lines ranked by similarity ==========")
	print(f"Buggy code lines candidates (ranked by similarity): \n{buggy_code_lines_ranked}")
	print("----------------------------------------")
	print(f"Ground truth: {BUGGY_LINE_GT}")
	print("===========================================================")

	# buggy type classification
	query = [TYPE_QUERY + "\n" + SPEC + "\n" + BUGGY]
	d_types = [CLS_DESC + b + '.' + BUG_DESC[b]
	for b in BUG_CLS.keys()]
	q_rep = model.encode(query,
	instruction=gritlm_instruction(REP_QUERY),
	max_length=4096)
	d_rep = model.encode(d_types,
	instruction=gritlm_instruction(""),
	max_length=128)
	cosine_sim = [1 - cosine(q_rep[0], d) for d in d_rep]
	sim_rank = np.argsort(cosine_sim)[::-1]
	buggy_type_ranked = [d_types[i] for i in sim_rank]
	print("============ Buggy type ranked by similarity ==============")
	print(f"Buggy type candidates (ranked by similarity): \n{[extract_bug_types(i)[0] for i in buggy_type_ranked]}")
	print("----------------------------------------")
	print(f"Ground truth: {BUGGY_CLS_GT}")
	print("===========================================================")


	### Generation ###
	instruct = f'{GEN_INST}{BUGGY}\n\nThe specification file of this code is:\n{SPEC}\n\nThe possible buggy lines ranking list are:\n{buggy_code_lines_ranked}\n\nThe possible bug type ranking list are:\n{buggy_type_ranked}\n\nYour task is to return me a json to analyze how the code should be modified, in the following format:\n{JSON_FORMAT}.'
	messages = [
	{"role": "user", "content": instruct},
	]
	encoded = model.tokenizer.apply_chat_template(
	messages, add_generation_prompt=True, return_tensors="pt")
	encoded = encoded.to(model.device)
	gen = model.generate(encoded, max_new_tokens=256, do_sample=True)
	valid_gen = gen[:, encoded.shape[1]:]
	decoded = model.tokenizer.batch_decode(valid_gen)
	# truncate the decoded text before </s>
	decoded = [d[:d.find('}')+1] for d in decoded]
	decoded_dict = json.loads(decoded[0])
	print("==================== Buggy fix ============================")
	print(f"Fix result: {decoded_dict}")
	print("----------------------------------------")
	print(f"Ground truth: {FIX_GT}")
	print("===========================================================")

	```