Show HN: I Am 15 and Built a Dual Backend MLP from Scratch Using CUDA C++

1 day ago 2

A fully from-scratch Multi-Layer Perceptron built in CUDA C++ with support for both GPU and CPU training. It features a clean, modular API for defining network architectures, loss functions, and activation functions without relying on external machine learning libraries. Whether you're experimenting on a CPU or training faster on a GPU, this dual-backend system enables you to easily switch between the two, making it ideal for both educational purposes and custom, low-level deep learning research

Built by a teenager with a deep passion for AI systems and systems-level programming

  • Dual Backend: Train your models on either CPU or GPU (CUDA) with a simple switch
  • Modular & Clean API: Easy to define and train models without any external dependencies
  • Loss Functions: Mean Squared Error (MSE), Cross Entropy (CE), and Binary Cross Entropy (BCE)
  • Activation Functions: Sigmoid, Relu, Leaky Relu, Tanh, and Linear
  • Optimizers: vanilla Stochastic Gradient Descent (SGD), Mini-Batch SGD, and Momentum
  • Weight Initialization Techniques: Xavier Normal, Xavier Uniform, He Normal, and He Uniform
  • Model Saving And Loading Mechanism
  • Fully Customizable: Choose batch size, learning rate, architecture, backend, and more
  • Eigen Library
  • CUDA Toolkit 7.5 or later
  • A CUDA-capable NVIDIA GPU
git clone https://github.com/muchlakshay/Dual-Backend-MLP-From-Scratch-CUDA.git cd Dual-Backend-MLP-From-Scratch-CUDA
  • Open the cloned repo and compile & run the project

If you're using Windows (e.g., with Git Bash or PowerShell), make sure nvcc is in your system PATH

nvcc main.cu kernles.cu matrix.cu -o main.exe ./main.exe

Loading Data (From CSV Files)

To load data from a CSV file you can use loadcsv.cuh header file. First import the header file and then you can use the load_csv_eigen() function to load the data.

load_csv_eigen(const std::string& filename, const std::string& target_column, float training_ratio = 0.8f)

#include loadcsv.cuh EigenDataT data {load_csv_eigen("data.csv", "target_column", 0.7)}; std::cout << "Training Features\n" << data.X_train << "\n"; std::cout << "Training Labels\n" << data.Y_train << "\n"; std::cout << "Testing Features\n" << data.X_test << "\n"; std::cout << "Testing Labels\n" << data.Y_test << "\n";

This funtion returns a Struct that contains Training Features (X_train), Training Labels (Y_train), Testing Features (X_test) and Testing Labels (Y_test)

You can also normalize the data using normalizeMatrix() function. It takes a reference to a EigenMatrix and then does in-place normalization

normalizeMatrix(EigenMatrix& matrix)

normalizeMatrix(data.X_train); std::cout << Normalized Training Features\n << data.X_train << "\n";

To one-hot-encode the labels you can use toOneHot() function. It takes Labels and number of classes as parameters and returns an EigenMatrix containing the one-hot-encoded labels (for multiclass calssification)

EigenMatrix toOneHot(EigenVector& labels, int num_labels)

EigenMatrix Y_train_ohe { toOneHot(data.Y_train) }; EigenMatrix Y_test_ohe { toOneHot(data.Y_test) } std::cout<< "One Hot Encoded Training Labels\n" << Y_train_ohe << "\n"; std::cout<< "One Hot Encoded Testing Labels\n" << Y_test_ohe << "\n";

To build a model architecture, first include the NeuralNetwork.cuh header file and initialize a NeuralNetwork class object

#include "NeuralNetwork.cuh" NeuralNetwork nn;

Now, first you have to define the size of input layer (number of columns in training features), you can do this using input() member function

void NeuralNetwork::input(int size)

input(data.X_train.cols());

Then to add hidden layers or output layer use extend() member funtion

void NeuralNetwork::extend(int neurons, const std::string& activation_function, const Initializer& initializer)

  • Supported activation function - "sigmoid", "relu", "tanh", "softmax" and "linear"
  • Supported weight initializers - He_Uniform, He_Normal, Xavier_Uniform, Xavier_Normal
//example nn.extend(16, "relu", NeuralNetwork::Initializer::Xavier_Uniform);

To configure learning rate, optimizer, loss function, batch_size and verbose use assemble() member function

void NeuralNetwork::assemble(const std::string& Loss_function, ElementType Learning_rate, int Batch_size, ElementType Momentum_coef=0.0f, bool Verbose=true)

  • Supported loss functions - "MSE", "cross_entropy", and "binary_cross_entropy"
  • Supported optimizers - SGD [Default] (Keep Momentum_coef = 0.0f) , Momentum (set Momentum_coef > 0.0)
nn.assemble("cross_entropy", 0.01f, 128, 0.95, true)

To start the training use learn() member function and for to predictions on data use predict() member function

void NeuralNetwork::learn(EigenMatrix& X_train, EigenMatrix& Y_train, int epochs, const TrainingDevice& device, bool enableShuffling)

EigenMatrix NeuralNetwork::predict(const EigenMatrix& to_predict)

  • Training Devices - CPU and GPU
//for CPU training nn.learn(data.X_train, Y_train_ohe, 100, NeuralNetwork::TrainingDevice::CPU, false); //for GPU training nn.learn(data.X_train, Y_train_ohe, 100, NeuralNetwork::TrainingDevice::GPU, false); //doing predictions auto predictions { nn.predict(data.X_test) };

Use info() member function to print information about the Network's Architecture

NeuralNetwork nn; nn.input(10); nn.extend(16, "relu", NeuralNetwork::Initializer::Xavier_Uniform); nn.extend(5, "softmax", NeuralNetwork::Initializer::Xavier_Uniform); nn.info();

Output:-

Layer 1: (Input Layer) Neurons: 10 Layer: 2 Neurons: 16 Activation: relu Weights: 16 x 10 (160) Biases : 16 Layer: 3 Neurons: 5 Activation: softmax Weights: 5 x 16 (80) Biases : 5 Total Weights: 240 Total Biases: 21 Total Parameters: 261

For saving the model after training use exportModel() member function and to import a exported model, use importModel() member function

void NeuralNetwork::exportModel(const std::string& filename) void NeuralNetwork::importModel(const std::string& filename)

//exporting a model nn.exportModel("model.txt"); //importing a model NeuralNetwork nn2; nn2.importModel("model.ext")

After importing the model, you can do predictions or further training, or fine-tuning on another dataset (transfer learning)

#include "NeuralNetwork.cuh" #include "loadcsv.cuh" int main() { //Data Loading EigenDataT data { EigenDataT data {load_csv_eigen("data.csv", "target_column", 0.8)}; }; //Normalizing Data normalizeMatrix(data.X_train); normalizeMatrix(data.X_test); //Model Building NeuralNetwork nn; nn.input(data.X_train.cols()); nn.extend(16, "leaky_relu", NeuralNetwork::Initializer::He_Normal); nn.extend(4, "leaky_relu", NeuralNetwork::Initializer::He_Normal); nn.extend(1, "sigmoid", NeuralNetwork::Initializer::He_Normal); nn.assemble("binary_cross_entropy", 0.001f, 64, 0.9, true); nn.learn(data.X_train, data.Y_train, 100, NeuralNetwork::TrainingDevice::GPU, true); //predictions auto predictions { nn.predict(data.X_test) }; //exporting trained model nn.exportModel("model.txt"); return 0; }

Example Training On MNIST Dataset

For loading MNIST dataset load_mnist.cuhheader file is used

#include "NeuralNetwork.cuh" #include "loadcsv.cuh" #include "load_mnist.cuh" int main() { //struct to load MNIST MNISTData mnist; //importing MNIST try { mnist = load_mnist("train-images.idx3-ubyte", "train-labels.idx1-ubyte", "t10k-images.idx3-ubyte", "t10k-labels.idx1-ubyte"); } catch (const std::exception& e) { std::cout << e.what(); } //Building Model NeuralNetwork nn; nn.input(mnist.X_train.cols()); nn.extend(16, "tanh", NeuralNetwork::Initializer::Xavier_Uniform); nn.extend(16, "tanh", NeuralNetwork::Initializer::Xavier_Uniform); nn.extend(10, "softmax", NeuralNetwork::Initializer::Xavier_Uniform); nn.assemble( "cross_entropy", 0.01f, 1000, 0.95f); //Starting Training nn.learn(mnist.X_train, mnist.Y_train, 20, NeuralNetwork::TrainingDevice::GPU, false); nn.exportModel("model.txt"); //Recognizing test digits auto predicted{ nn.predict(mnist.X_test) }; //printing predictions for (int i{}; i < predicted.rows(); ++i) { std::cout << "Actual: " << mnist.Y_test.row(i) << " Predicted: " << predicted.row(i) << "\n"; } //Calculation accuracy std::cout << "Accuracy: " << calculateAccuracy(predicted, mnist.Y_test) << "%"; //exporting trained model (for future predictions or fine-tuning parameters more) nn.exportModel("model_mnist.txt"); return 0; }

Image of a few predictions and accuracy

image

LIMITATION / KNOWN ISSUES

  • Unoptimized CUDA Kernels : The current GPU implementation prioritizes clarity, modularity, and reusability over low-level performance optimizations. It does not use Shared memory, Tiling strategies and Fused kernels but these can be added later for performance-critical use cases.
  • Lack of Training Metrics : The model currently reports only the loss after each epoch. It does not track or display Accuracy (Of Each Epoch), Validation performance, Training time or speed metrics
  • Only few activation and loss functions are available.
Read Entire Article