Write A Cuda Program For Matrix Multiplication
Matrix multiplication is a fundamental operation in various computational fields. Efficiently performing this operation, especially with large matrices, is critical for many applications. In this article, you will learn how to implement matrix multiplication using CUDA to leverage the parallel processing power of NVIDIA GPUs.
Problem Statement
Matrix multiplication involves taking two matrices, say A (of dimensions M x K) and B (of dimensions K x N), and producing a third matrix C (of dimensions M x N). Each element C[i][j] is calculated as the sum of the products of corresponding elements from row i of matrix A and column j of matrix B. A naive CPU implementation of matrix multiplication has a time complexity of O(MKN), which becomes computationally expensive for large matrices, leading to significant performance bottlenecks in applications like deep learning, scientific simulations, and image processing.
Example
Consider multiplying two 2x2 matrices:
Matrix A:
1 2
3 4
Matrix B:
5 6
7 8
The resulting matrix C would be:
C[0][0] = (A[0][0] * B[0][0]) + (A[0][1] * B[1][0]) = (1 * 5) + (2 * 7) = 5 + 14 = 19
C[0][1] = (A[0][0] * B[0][1]) + (A[0][1] * B[1][1]) = (1 * 6) + (2 * 8) = 6 + 16 = 22
C[1][0] = (A[1][0] * B[0][0]) + (A[1][1] * B[1][0]) = (3 * 5) + (4 * 7) = 15 + 28 = 43
C[1][1] = (A[1][0] * B[0][1]) + (A[1][1] * B[1][1]) = (3 * 6) + (4 * 8) = 18 + 32 = 50
Resulting Matrix C:
19 22
43 50
Background & Knowledge Prerequisites
To effectively understand and implement CUDA matrix multiplication, familiarity with the following concepts is beneficial:
- C/C++ Programming: Basic syntax, memory management, and data structures.
- GPU Architecture Fundamentals: Understanding of threads, blocks, grids, warps, and shared memory.
- CUDA Programming Model: Concepts like host and device memory, kernel functions (
__global__), device functions (__device__), and host functions (__host__). - CUDA Toolkit: Installation and setup of the NVIDIA CUDA development environment, including
nvcccompiler.
For this program, the necessary setup includes having a CUDA-capable NVIDIA GPU and the CUDA Toolkit installed on your system. No special imports beyond standard C libraries and cuda_runtime.h are required.
Use Cases or Case Studies
Matrix multiplication is a cornerstone operation across various computational fields:
- Deep Learning: It forms the core of neural network computations, especially in convolutional layers and fully connected layers for training and inference.
- Image Processing: Operations like filtering, transformations, and convolutions often rely on matrix multiplications for efficient pixel manipulation.
- Scientific Simulations: Solving systems of linear equations, fluid dynamics, quantum mechanics, and finite element analysis frequently involve large-scale matrix operations.
- Computer Graphics: 3D transformations (scaling, rotation, translation), projection, and rendering pipelines extensively use matrix multiplications.
- Signal Processing: Digital filters, spectral analysis, and beamforming algorithms often leverage matrix multiplication for data transformation and analysis.
Solution Approaches
For matrix multiplication on a GPU, the most straightforward approach involves assigning each output element C[i][j] to a unique thread. This maximizes parallelism by allowing all elements of the result matrix to be computed concurrently.
1. Element-wise Thread Assignment (Naive CUDA)
This approach assigns one CUDA thread to calculate a single element of the output matrix C.
- One-line summary: Each thread computes one element of the result matrix
Cby iterating over the necessary row ofAand column ofB.
- Code example:
// CUDA Matrix Multiplication
#include <stdio.h>
#include <stdlib.h> // For malloc and free
#include <cuda_runtime.h> // For CUDA API functions
// Define matrix dimensions
#define M 256
#define K 256
#define N 256
// CUDA error check macro
#define CHECK_CUDA_ERROR(val) check((val), #val, __FILE__, __LINE__)
void check(cudaError_t err, const char* const func, const char* const file, const int line) {
if (err != cudaSuccess) {
fprintf(stderr, "CUDA Error at %s:%d: %s %s\\n", file, line, func, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
}
// CUDA kernel to perform matrix multiplication
__global__ void matrixMulKernel(float* A, float* B, float* C, int widthA, int heightA, int widthB, int heightB) {
// Calculate the row and column of the C element to be computed by this thread
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
// Check if the thread is within the bounds of the output matrix C
if (row < heightA && col < widthB) {
float sum = 0.0f;
for (int i = 0; i < widthA; ++i) { // widthA == heightB (K dimension)
sum += A[row * widthA + i] * B[i * widthB + col];
}
C[row * widthB + col] = sum;
}
}
// Function to initialize matrices
void initializeMatrix(float* matrix, int rows, int cols) {
for (int i = 0; i < rows * cols; ++i) {
matrix[i] = rand() / (float)RAND_MAX; // Random float between 0 and 1
}
}
// Function to print a matrix (for small matrices)
void printMatrix(float* matrix, int rows, int cols) {
if (rows > 10 || cols > 10) { // Limit printing for large matrices
printf("Matrix too large to print.\\n");
return;
}
for (int i = 0; i < rows; ++i) {
for (int j = 0; j < cols; ++j) {
printf("%.2f ", matrix[i * cols + j]);
}
printf("\\n");
}
}
int main() {
// Step 1: Allocate memory on host (CPU)
float *h_A, *h_B, *h_C;
size_t sizeA = M * K * sizeof(float);
size_t sizeB = K * N * sizeof(float);
size_t sizeC = M * N * sizeof(float);
h_A = (float*)malloc(sizeA);
h_B = (float*)malloc(sizeB);
h_C = (float*)malloc(sizeC);
if (h_A == NULL || h_B == NULL || h_C == NULL) {
fprintf(stderr, "Failed to allocate host memory!\\n");
exit(EXIT_FAILURE);
}
// Step 2: Initialize host matrices
printf("Initializing host matrices...\\n");
initializeMatrix(h_A, M, K);
initializeMatrix(h_B, K, N);
// Optional: Print small matrices
printf("Matrix A (M x K):\\n");
printMatrix(h_A, M, K);
printf("Matrix B (K x N):\\n");
printMatrix(h_B, K, N);
// Step 3: Allocate memory on device (GPU)
float *d_A, *d_B, *d_C;
CHECK_CUDA_ERROR(cudaMalloc((void**)&d_A, sizeA));
CHECK_CUDA_ERROR(cudaMalloc((void**)&d_B, sizeB));
CHECK_CUDA_ERROR(cudaMalloc((void**)&d_C, sizeC));
// Step 4: Copy input matrices from host to device
printf("Copying data from host to device...\\n");
CHECK_CUDA_ERROR(cudaMemcpy(d_A, h_A, sizeA, cudaMemcpyHostToDevice));
CHECK_CUDA_ERROR(cudaMemcpy(d_B, h_B, sizeB, cudaMemcpyHostToDevice));
// Step 5: Define grid and block dimensions for the kernel launch
// A common choice is 16x16 or 32x32 threads per block
int THREADS_PER_BLOCK_X = 16;
int THREADS_PER_BLOCK_Y = 16;
dim3 dimBlock(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y);
dim3 dimGrid(
(N + dimBlock.x - 1) / dimBlock.x, // Grid width (cols of C)
(M + dimBlock.y - 1) / dimBlock.y // Grid height (rows of C)
);
// Step 6: Launch the CUDA kernel
printf("Launching CUDA kernel...\\n");
matrixMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C, K, M, N, K); // K is widthA and heightB
CHECK_CUDA_ERROR(cudaGetLastError()); // Check for errors during kernel launch
// Step 7: Copy the result matrix from device to host
printf("Copying result from device to host...\\n");
CHECK_CUDA_ERROR(cudaMemcpy(h_C, d_C, sizeC, cudaMemcpyDeviceToHost));
// Step 8: Print the result (optional for small matrices)
printf("Result Matrix C (M x N):\\n");
printMatrix(h_C, M, N);
// Step 9: Free device memory
CHECK_CUDA_ERROR(cudaFree(d_A));
CHECK_CUDA_ERROR(cudaFree(d_B));
CHECK_CUDA_ERROR(cudaFree(d_C));
// Step 10: Free host memory
free(h_A);
free(h_B);
free(h_C);
printf("Matrix multiplication completed and memory freed.\\n");
return 0;
}
- Sample output:
Initializing host matrices...
Matrix A (256 x 256):
Matrix too large to print.
Matrix B (256 x 256):
Matrix too large to print.
Copying data from host to device...
Launching CUDA kernel...
Copying result from device to host...
Result Matrix C (256 x 256):
Matrix too large to print.
Matrix multiplication completed and memory freed.
If you change M, K, N to smaller values like M=2, K=2, N=2 and enable printMatrix, you would see:
Initializing host matrices...
Matrix A (2 x 2):
0.84 0.96
0.37 0.99
Matrix B (2 x 2):
0.04 0.98
0.73 0.28
Copying data from host to device...
Launching CUDA kernel...
Copying result from device to host...
Result Matrix C (2 x 2):
0.73 1.09
0.74 0.65
Matrix multiplication completed and memory freed.
(Actual values depend on rand() seed)
- Stepwise explanation:
- Define Dimensions:
M,K,Nconstants define the dimensions of matricesA(M x K),B(K x N), andC(M x N). - Error Handling: A
CHECK_CUDA_ERRORmacro is defined for robust error checking of CUDA API calls. - Kernel Definition:
- The
matrixMulKernelfunction is declared with__global__, marking it as a CUDA kernel that will run on the GPU.
- The
C matrix using blockIdx.y, blockDim.y, threadIdx.y for the row and blockIdx.x, blockDim.x, threadIdx.x for the column.row and col are within the valid matrix bounds (heightA for rows of C, widthB for columns of C).C[row][col] element by summing products from the corresponding row of A and column of B.- Host Code (
mainfunction):- Memory Allocation (Host):
h_A,h_B,h_Care allocated on the CPU usingmalloc.
- Memory Allocation (Host):
h_A and h_B are filled with random floating-point values using initializeMatrix.d_A, d_B, d_C are allocated on the GPU using cudaMalloc.h_A and h_B are copied from host to device memory using cudaMemcpy(..., cudaMemcpyHostToDevice).dim3 dimBlock defines the number of threads per block (e.g., 16x16). dim3 dimGrid calculates the number of blocks needed to cover the entire output matrix C. The (N + dimBlock.x - 1) / dimBlock.x pattern ensures that enough blocks are launched even if the matrix dimensions are not perfectly divisible by the block dimensions.matrixMulKernel<<>> syntax launches the kernel on the GPU.d_C is copied back from device to host memory using cudaMemcpy(..., cudaMemcpyDeviceToHost).cudaFree is used to release device memory, and free for host memory.Conclusion
Implementing matrix multiplication with CUDA significantly accelerates the computation compared to CPU-only solutions, especially for large matrices. By assigning each output element calculation to a separate thread, the GPU's massive parallel processing capabilities can be fully utilized. This fundamental parallelization pattern is a cornerstone for many high-performance computing tasks on GPUs.
Summary
- Matrix multiplication is a computationally intensive operation with O(N^3) complexity for naive CPU implementations.
- GPUs, with their parallel architecture, are well-suited for accelerating matrix multiplication.
- CUDA allows direct programming of NVIDIA GPUs using a C/C++ based language.
- A common CUDA approach assigns one thread per output matrix element.
- The process involves:
- Allocating memory on both host and device.
- Copying input data from host to device.
- Defining grid and block dimensions for kernel launch.
- Launching a
__global__kernel to perform calculations. - Copying the results back from device to host.
- Freeing allocated memory.
- This parallelization strategy is crucial for performance in deep learning, scientific computing, and graphics.