Write A Cuda Program For Matrix Multiplication

Last Updated :1 Dec, 2025

Problem Statement

Matrix multiplication involves taking two matrices, say A (of dimensions M x K) and B (of dimensions K x N), and producing a third matrix C (of dimensions M x N). Each element C[i][j] is calculated as the sum of the products of corresponding elements from row i of matrix A and column j of matrix B. A naive CPU implementation of matrix multiplication has a time complexity of O(MKN), which becomes computationally expensive for large matrices, leading to significant performance bottlenecks in applications like deep learning, scientific simulations, and image processing.

Example

Consider multiplying two 2x2 matrices:

Matrix A:

1 2
3 4

Matrix B:

5 6
7 8

The resulting matrix C would be:

C[0][0] = (A[0][0] * B[0][0]) + (A[0][1] * B[1][0]) = (1 * 5) + (2 * 7) = 5 + 14 = 19
C[0][1] = (A[0][0] * B[0][1]) + (A[0][1] * B[1][1]) = (1 * 6) + (2 * 8) = 6 + 16 = 22
C[1][0] = (A[1][0] * B[0][0]) + (A[1][1] * B[1][0]) = (3 * 5) + (4 * 7) = 15 + 28 = 43
C[1][1] = (A[1][0] * B[0][1]) + (A[1][1] * B[1][1]) = (3 * 6) + (4 * 8) = 18 + 32 = 50

Resulting Matrix C:

19 22
43 50

Background & Knowledge Prerequisites

To effectively understand and implement CUDA matrix multiplication, familiarity with the following concepts is beneficial:

C/C++ Programming: Basic syntax, memory management, and data structures.
GPU Architecture Fundamentals: Understanding of threads, blocks, grids, warps, and shared memory.
CUDA Programming Model: Concepts like host and device memory, kernel functions (__global__), device functions (__device__), and host functions (__host__).
CUDA Toolkit: Installation and setup of the NVIDIA CUDA development environment, including nvcc compiler.

For this program, the necessary setup includes having a CUDA-capable NVIDIA GPU and the CUDA Toolkit installed on your system. No special imports beyond standard C libraries and cuda_runtime.h are required.

Use Cases or Case Studies

Matrix multiplication is a cornerstone operation across various computational fields:

Deep Learning: It forms the core of neural network computations, especially in convolutional layers and fully connected layers for training and inference.
Image Processing: Operations like filtering, transformations, and convolutions often rely on matrix multiplications for efficient pixel manipulation.
Scientific Simulations: Solving systems of linear equations, fluid dynamics, quantum mechanics, and finite element analysis frequently involve large-scale matrix operations.
Computer Graphics: 3D transformations (scaling, rotation, translation), projection, and rendering pipelines extensively use matrix multiplications.
Signal Processing: Digital filters, spectral analysis, and beamforming algorithms often leverage matrix multiplication for data transformation and analysis.

Solution Approaches

For matrix multiplication on a GPU, the most straightforward approach involves assigning each output element C[i][j] to a unique thread. This maximizes parallelism by allowing all elements of the result matrix to be computed concurrently.

1. Element-wise Thread Assignment (Naive CUDA)

This approach assigns one CUDA thread to calculate a single element of the output matrix C.

One-line summary: Each thread computes one element of the result matrix C by iterating over the necessary row of A and column of B.

Code example:

// CUDA Matrix Multiplication
#include <stdio.h>
#include <stdlib.h> // For malloc and free
#include <cuda_runtime.h> // For CUDA API functions

// Define matrix dimensions
#define M 256
#define K 256
#define N 256

// CUDA error check macro
#define CHECK_CUDA_ERROR(val) check((val), #val, __FILE__, __LINE__)
void check(cudaError_t err, const char* const func, const char* const file, const int line) {
    if (err != cudaSuccess) {
        fprintf(stderr, "CUDA Error at %s:%d: %s %s\\n", file, line, func, cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }
}

// CUDA kernel to perform matrix multiplication
__global__ void matrixMulKernel(float* A, float* B, float* C, int widthA, int heightA, int widthB, int heightB) {
    // Calculate the row and column of the C element to be computed by this thread
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    // Check if the thread is within the bounds of the output matrix C
    if (row < heightA && col < widthB) {
        float sum = 0.0f;
        for (int i = 0; i < widthA; ++i) { // widthA == heightB (K dimension)
            sum += A[row * widthA + i] * B[i * widthB + col];
        }
        C[row * widthB + col] = sum;
    }
}

// Function to initialize matrices
void initializeMatrix(float* matrix, int rows, int cols) {
    for (int i = 0; i < rows * cols; ++i) {
        matrix[i] = rand() / (float)RAND_MAX; // Random float between 0 and 1
    }
}

// Function to print a matrix (for small matrices)
void printMatrix(float* matrix, int rows, int cols) {
    if (rows > 10 || cols > 10) { // Limit printing for large matrices
        printf("Matrix too large to print.\\n");
        return;
    }
    for (int i = 0; i < rows; ++i) {
        for (int j = 0; j < cols; ++j) {
            printf("%.2f ", matrix[i * cols + j]);
        }
        printf("\\n");
    }
}

int main() {
    // Step 1: Allocate memory on host (CPU)
    float *h_A, *h_B, *h_C;
    size_t sizeA = M * K * sizeof(float);
    size_t sizeB = K * N * sizeof(float);
    size_t sizeC = M * N * sizeof(float);

    h_A = (float*)malloc(sizeA);
    h_B = (float*)malloc(sizeB);
    h_C = (float*)malloc(sizeC);

    if (h_A == NULL || h_B == NULL || h_C == NULL) {
        fprintf(stderr, "Failed to allocate host memory!\\n");
        exit(EXIT_FAILURE);
    }

    // Step 2: Initialize host matrices
    printf("Initializing host matrices...\\n");
    initializeMatrix(h_A, M, K);
    initializeMatrix(h_B, K, N);

    // Optional: Print small matrices
    printf("Matrix A (M x K):\\n");
    printMatrix(h_A, M, K);
    printf("Matrix B (K x N):\\n");
    printMatrix(h_B, K, N);

    // Step 3: Allocate memory on device (GPU)
    float *d_A, *d_B, *d_C;
    CHECK_CUDA_ERROR(cudaMalloc((void**)&d_A, sizeA));
    CHECK_CUDA_ERROR(cudaMalloc((void**)&d_B, sizeB));
    CHECK_CUDA_ERROR(cudaMalloc((void**)&d_C, sizeC));

    // Step 4: Copy input matrices from host to device
    printf("Copying data from host to device...\\n");
    CHECK_CUDA_ERROR(cudaMemcpy(d_A, h_A, sizeA, cudaMemcpyHostToDevice));
    CHECK_CUDA_ERROR(cudaMemcpy(d_B, h_B, sizeB, cudaMemcpyHostToDevice));

    // Step 5: Define grid and block dimensions for the kernel launch
    // A common choice is 16x16 or 32x32 threads per block
    int THREADS_PER_BLOCK_X = 16;
    int THREADS_PER_BLOCK_Y = 16;

    dim3 dimBlock(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y);
    dim3 dimGrid(
        (N + dimBlock.x - 1) / dimBlock.x, // Grid width (cols of C)
        (M + dimBlock.y - 1) / dimBlock.y  // Grid height (rows of C)
    );

    // Step 6: Launch the CUDA kernel
    printf("Launching CUDA kernel...\\n");
    matrixMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C, K, M, N, K); // K is widthA and heightB
    CHECK_CUDA_ERROR(cudaGetLastError()); // Check for errors during kernel launch

    // Step 7: Copy the result matrix from device to host
    printf("Copying result from device to host...\\n");
    CHECK_CUDA_ERROR(cudaMemcpy(h_C, d_C, sizeC, cudaMemcpyDeviceToHost));

    // Step 8: Print the result (optional for small matrices)
    printf("Result Matrix C (M x N):\\n");
    printMatrix(h_C, M, N);

    // Step 9: Free device memory
    CHECK_CUDA_ERROR(cudaFree(d_A));
    CHECK_CUDA_ERROR(cudaFree(d_B));
    CHECK_CUDA_ERROR(cudaFree(d_C));

    // Step 10: Free host memory
    free(h_A);
    free(h_B);
    free(h_C);

    printf("Matrix multiplication completed and memory freed.\\n");
    return 0;
}

Run

Sample output:

(Output may vary due to random initialization and large matrix size, printing for large matrices is disabled for clarity. For M=2, K=2, N=2, output would be similar to the example calculated manually).

Initializing host matrices...
    Matrix A (256 x 256):
    Matrix too large to print.
    Matrix B (256 x 256):
    Matrix too large to print.
    Copying data from host to device...
    Launching CUDA kernel...
    Copying result from device to host...
    Result Matrix C (256 x 256):
    Matrix too large to print.
    Matrix multiplication completed and memory freed.

If you change M, K, N to smaller values like M=2, K=2, N=2 and enable printMatrix, you would see:

Initializing host matrices...
    Matrix A (2 x 2):
    0.84 0.96
    0.37 0.99
    Matrix B (2 x 2):
    0.04 0.98
    0.73 0.28
    Copying data from host to device...
    Launching CUDA kernel...
    Copying result from device to host...
    Result Matrix C (2 x 2):
    0.73 1.09
    0.74 0.65
    Matrix multiplication completed and memory freed.

(Actual values depend on rand() seed)

Stepwise explanation:

Define Dimensions: M, K, N constants define the dimensions of matrices A (M x K), B (K x N), and C (M x N).
Error Handling: A CHECK_CUDA_ERROR macro is defined for robust error checking of CUDA API calls.
Kernel Definition:
- The matrixMulKernel function is declared with __global__, marking it as a CUDA kernel that will run on the GPU.

Each thread identifies its unique position in the C matrix using blockIdx.y, blockDim.y, threadIdx.y for the row and blockIdx.x, blockDim.x, threadIdx.x for the column.

It then checks if row and col are within the valid matrix bounds (heightA for rows of C, widthB for columns of C).

Inside the loop, each thread computes its assigned C[row][col] element by summing products from the corresponding row of A and column of B.

Host Code (main function):
- Memory Allocation (Host): h_A, h_B, h_C are allocated on the CPU using malloc.

Initialization: h_A and h_B are filled with random floating-point values using initializeMatrix.

Memory Allocation (Device): d_A, d_B, d_C are allocated on the GPU using cudaMalloc.

Data Transfer (H2D): Input matrices h_A and h_B are copied from host to device memory using cudaMemcpy(..., cudaMemcpyHostToDevice).

Kernel Launch Configuration: dim3 dimBlock defines the number of threads per block (e.g., 16x16). dim3 dimGrid calculates the number of blocks needed to cover the entire output matrix C. The (N + dimBlock.x - 1) / dimBlock.x pattern ensures that enough blocks are launched even if the matrix dimensions are not perfectly divisible by the block dimensions.

Kernel Launch: The matrixMulKernel<<>> syntax launches the kernel on the GPU.

Data Transfer (D2H): The computed result d_C is copied back from device to host memory using cudaMemcpy(..., cudaMemcpyDeviceToHost).

Memory Deallocation: cudaFree is used to release device memory, and free for host memory.

Conclusion

Implementing matrix multiplication with CUDA significantly accelerates the computation compared to CPU-only solutions, especially for large matrices. By assigning each output element calculation to a separate thread, the GPU's massive parallel processing capabilities can be fully utilized. This fundamental parallelization pattern is a cornerstone for many high-performance computing tasks on GPUs.

Summary

Matrix multiplication is a computationally intensive operation with O(N^3) complexity for naive CPU implementations.
GPUs, with their parallel architecture, are well-suited for accelerating matrix multiplication.
CUDA allows direct programming of NVIDIA GPUs using a C/C++ based language.
A common CUDA approach assigns one thread per output matrix element.
The process involves:
Allocating memory on both host and device.
Copying input data from host to device.
Defining grid and block dimensions for kernel launch.
Launching a __global__ kernel to perform calculations.
Copying the results back from device to host.
Freeing allocated memory.
This parallelization strategy is crucial for performance in deep learning, scientific computing, and graphics.

Hope, This article was helpful?

Sorry about that

How can we improve it?

Thank you for helping us improve this article.

Write A Cuda Program For Matrix Multiplication

Problem Statement

Example

Background & Knowledge Prerequisites

Use Cases or Case Studies

Solution Approaches

1. Element-wise Thread Assignment (Naive CUDA)

Conclusion

Summary

Sorry about that

Recent Articles

Related Articles

C Program Of Simple Bubble Sort Implementation Using Array Ascending Order...

C Program For Insertion Sort In Descending Order...

C Program For Selection Sort Algorithm In Data Structure...

C Program To Implement Oddeven Sorting...

C Program For Selection Sort In Descending Order...

C Program For Shell Sort In Data Structure...

Related Articles

C Program Of Simple Bubble Sort Implementation Using Array Ascending Order...

C Program For Insertion Sort In Descending Order...

C Program For Selection Sort Algorithm In Data Structure...

C Program To Implement Oddeven Sorting...

C Program For Selection Sort In Descending Order...

C Program For Shell Sort In Data Structure...