zhujiace's picture
refine prompt
fb067ce
You are asked to refine a CUDA programming task description input into a complete and precise instruction suitable for LLMs to generate. The input prompt is relatively simple and we need to refine the input prompt to ensure clarity, consistency, and better comprehension by the LLM.
Instructions: The refined instruction should be a single declarative sentence, beginning with an action verb such as "Write", "Implement", or "Create". Only output the refined prompt text, without any explanation or commentary. Do not include headers, labels, or markdown formatting.
Below are some examples of the refined version using a more standardized format:
1. Sigmoid, Refined prompt example: Implement a CUDA program for sigmoid activation function: $\text{sigmoid}(x) = 1 / (1 + \exp(-x))$. Input shape: (batch\_size, dim); Output: same shape as input.
2. Matrix Multiplication, Refined prompt example: Write a program that multiplies two matrices of 32-bit floating point numbers on a GPU. Given matrix $A$ of dimensions $M \times K$ and matrix $B$ of dimensions $K \times N$, compute the product matrix $C = A \times B$, which will have dimensions $M \times N$.
3. Max Pooling 3D, Refined prompt example: Implement a CUDA program for 3D max pooling function that selects the maximum value within a defined region (a window) of a feature map. Input shape: (batch\_size, channels, dim1, dim2, dim3); Output: with 3D max pooling applied.
4. LayerNorm, Refined prompt example: Implement a GPU program that performs Layer Normalization (LayerNorm) operation, which normalizes across the features for each individual data sample in a layer. Input of shape (batch\_size, features, dim1, dim2); Output with Layer Normalization applied, same shape as input.
5. 2D Convolution, Refined prompt example: Write a program that performs a 2D convolution operation on the GPU. Given an input matrix and a kernel (filter), compute the convolved output. The convolution should be performed with a "valid" boundary condition, meaning the kernel is only applied where it fully overlaps with the input. The input consists of: (1) input: A 2D matrix of 32-bit floating-point numbers, represented as a 1D array in row-major order. (2) kernel: A 2D kernel (filter) of 32-bit floating-point numbers, also represented as a 1D array in row-major order. The output should be written to the output matrix (also a 1D array in row-major order). The output matrix will have dimensions: output\_rows = input\_rows - kernel\_rows + 1, output\_cols = input\_cols - kernel\_cols + 1. The convolution operation is defined as: $output[i][j] = \sum_{m=0}^{kernel\_rows-1} \sum_{n=0}^{kernel\_cols-1} input[i+m][j+n] * kernel[m][n]$.
6. Multi-Head Self-Attention, Refined prompt example: Implement a CUDA program for multi-head self-attention. Given three input matrices $Q$ (queries), $K$ (keys), and $V$ (values) of size $N \times d_{\text{model}}$, compute: $\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1,\ldots,\text{head}_h)$, where each head computes: $\text{head}_i = \text{softmax}\left(\frac{Q_iK_i^T}{\sqrt{d_k}}\right)V_i$ with $d_k = d_{\text{model}}/h$ and $Q_i$, $K_i$, $V_i$ being the $i$-th head's partition of the input matrices.
7. Mean Square Error, Refined prompt example: Implement a CUDA program to calculate the Mean Squared Error (MSE) between predicted values and target values. Given two arrays of equal length, predictions and targets, compute: $\text{MSE}=\frac{1}{N}\sum_{i=1}^{N}(predictions_{i}-targets_{i})^2$ where $N$ is the number of elements in each array. Input: predictions, targets; Output: MSE.
8. Matrix Transpose, Refined prompt example: Write a program that transposes a matrix of 32-bit floating point numbers on a GPU. The transpose of a matrix switches its rows and columns. Given a matrix $A$ of dimensions rows $\times$ cols, the transpose $A^T$ will have dimensions cols $\times$ rows. All matrices are stored in row-major format.
9. Reverse Array, Refined prompt example: Implement a program that reverses an array of 32-bit floating point numbers in-place. The program should perform an in-place reversal of input.
10. ReLU Activation Fuction, Refined prompt example: Implement a program that performs the Rectified Linear Unit (ReLU) activation function on a vector of 32-bit floating point numbers. The ReLU function sets all negative values to zero and leaves positive values unchanged: $\text{ReLU}(x)=\max(0,x)$.
11. Top-K Selection, Refined prompt example: Implement a GPU program that, given a 1D array input of 32-bit floating point numbers of length $N$, selects the $k$ largest elements and writes them in descending order to the output array of length $k$.
12. Sorting, Refined prompt example: Write a CUDA program that sorts an array of 32-bit floating-point numbers in ascending order using the bubble sort algorithm. Do not use other algorithms.
13. Matrix Copy, Refined prompt example: Implement a program that copies an $N \times N$ matrix of 32-bit floating point numbers from input array $A$ to output array $B$ on the GPU. The program should perform a direct element-wise copy so that $B_{i,j} = A_{i,j}$ for all valid indices.
14. Reduction, Refined prompt example: Write a CUDA program that performs parallel reduction on an array of 32-bit floating point numbers to compute their sum. The program should take an input array and produce a single output value containing the sum of all elements.
15. Dot Product, Refined prompt example: Implement a CUDA program that computes the dot product of two vectors containing 32-bit floating point numbers. The dot product is the sum of the products of the corresponding elements of two vectors. Mathematically, the dot product of two vectors $A$ and $B$ of length $n$ is defined as: $A \cdot B = \sum_{i=0}^{n-1} A_i \cdot B_i$.
16. Prefix Sum, Refined prompt example: Write a CUDA program that computes the prefix sum (cumulative sum) of an array of 32-bit floating point numbers. For an input array $[a, b, c, d, \ldots]$, the prefix sum is $[a, a+b, a+b+c, a+b+c+d, \ldots]$.
17. Categorical Cross-Entropy Loss, Refined prompt example: Implement a CUDA program to calculate the categorical cross-entropy loss for a batch of predictions. Given a matrix of predicted logits $Z$ of size $N \times C$ and a vector of true class labels true\_labels of size $N$, compute the average cross-entropy loss over the batch. The loss for a single sample $j$ with logits $z_j = [z_{j1}, \ldots, z_{jC}]$ and true label $y_j$ is calculated using the numerically stable formula: $\text{Loss}_j = \log\left(\sum_{k=1}^{C} e^{z_{jk}}\right) - z_{j, y_j}$. The final output stored in the loss variable should be the average loss over the $N$ samples: $L = \frac{1}{N} \sum_{j=1}^{N} \text{Loss}_j$. Input: logits, true\_labels, $N$ (number of samples), and $C$ (number of classes). Output: loss (a pointer to a single float).
18. Monte Carlo Integration, Refined prompt example: Implement Monte Carlo integration on a GPU. Given a set of function values $y_i=f(x_i)$ sampled at random points uniformly distributed in the interval $[a, b]$, estimate the definite integral: $\int_{a}^{b}f(x)dx\approx (b-a)\cdot\frac{1}{n}\sum_{i=1}^{N}y_i$. The Monte Carlo method approximates the integral by computing the average of the function values and multiplying by the interval width.
19. Histogramming, Refined prompt example: Write a GPU program that computes the histogram of an array of 32-bit integers. The histogram should count the number of occurrences of each integer value in the range [0, num\_bins). You are given an input array input of length $N$ and the number of bins num\_bins. The result should be an array of integers of length num\_bins, where each element represents the count of occurrences of its corresponding index in the input array.
20. Ordinary Least Squares Regression, Refined prompt example: Solve the Ordinary Least Squares (OLS) regression problem on a GPU. Given a feature matrix $X$ of size $n\_samples \times n\_features$ and a target vector $y$ of size $n\_samples$, compute the coefficient vector $\beta$ that minimizes the sum of squared residuals: $\min_{\beta} \X\beta - y\^2$. The closed-form solution to OLS is: $\beta = (X^TX)^{-1}X^Ty$.
Please rewrite the following prompt accordingly:
[input.txt]