A fully from-scratch Multi-Layer Perceptron built in CUDA C++ with support for both GPU and CPU training. It features a clean, modular API for defining network architectures, loss functions, and activation functions without relying on external machine learning libraries. Whether you're experimenting on a CPU or training faster on a GPU, this dual-backend system enables you to easily switch between the two, making it ideal for both educational purposes and custom, low-level deep learning research
Built by a teenager with a deep passion for AI systems and systems-level programming
- Dual Backend: Train your models on either CPU or GPU (CUDA) with a simple switch
- Modular & Clean API: Easy to define and train models without any external dependencies
- Loss Functions: Mean Squared Error (MSE), Cross Entropy (CE), and Binary Cross Entropy (BCE)
- Activation Functions: Sigmoid, Relu, Leaky Relu, Tanh, and Linear
- Optimizers: vanilla Stochastic Gradient Descent (SGD), Mini-Batch SGD, and Momentum
- Weight Initialization Techniques: Xavier Normal, Xavier Uniform, He Normal, and He Uniform
- Model Saving And Loading Mechanism
- Fully Customizable: Choose batch size, learning rate, architecture, backend, and more
- Eigen Library
- CUDA Toolkit 7.5 or later
- A CUDA-capable NVIDIA GPU
- Open the cloned repo and compile & run the project
If you're using Windows (e.g., with Git Bash or PowerShell), make sure nvcc is in your system PATH
To load data from a CSV file you can use loadcsv.cuh header file. First import the header file and then you can use the load_csv_eigen() function to load the data.
load_csv_eigen(const std::string& filename, const std::string& target_column, float training_ratio = 0.8f)
This funtion returns a Struct that contains Training Features (X_train), Training Labels (Y_train), Testing Features (X_test) and Testing Labels (Y_test)
You can also normalize the data using normalizeMatrix() function. It takes a reference to a EigenMatrix and then does in-place normalization
normalizeMatrix(EigenMatrix& matrix)
To one-hot-encode the labels you can use toOneHot() function. It takes Labels and number of classes as parameters and returns an EigenMatrix containing the one-hot-encoded labels (for multiclass calssification)
EigenMatrix toOneHot(EigenVector& labels, int num_labels)
To build a model architecture, first include the NeuralNetwork.cuh header file and initialize a NeuralNetwork class object
Now, first you have to define the size of input layer (number of columns in training features), you can do this using input() member function
void NeuralNetwork::input(int size)
Then to add hidden layers or output layer use extend() member funtion
void NeuralNetwork::extend(int neurons, const std::string& activation_function, const Initializer& initializer)
- Supported activation function - "sigmoid", "relu", "tanh", "softmax" and "linear"
- Supported weight initializers - He_Uniform, He_Normal, Xavier_Uniform, Xavier_Normal
To configure learning rate, optimizer, loss function, batch_size and verbose use assemble() member function
void NeuralNetwork::assemble(const std::string& Loss_function, ElementType Learning_rate, int Batch_size, ElementType Momentum_coef=0.0f, bool Verbose=true)
- Supported loss functions - "MSE", "cross_entropy", and "binary_cross_entropy"
- Supported optimizers - SGD [Default] (Keep Momentum_coef = 0.0f) , Momentum (set Momentum_coef > 0.0)
To start the training use learn() member function and for to predictions on data use predict() member function
void NeuralNetwork::learn(EigenMatrix& X_train, EigenMatrix& Y_train, int epochs, const TrainingDevice& device, bool enableShuffling)
EigenMatrix NeuralNetwork::predict(const EigenMatrix& to_predict)
- Training Devices - CPU and GPU
Use info() member function to print information about the Network's Architecture
Output:-
For saving the model after training use exportModel() member function and to import a exported model, use importModel() member function
void NeuralNetwork::exportModel(const std::string& filename) void NeuralNetwork::importModel(const std::string& filename)
After importing the model, you can do predictions or further training, or fine-tuning on another dataset (transfer learning)
For loading MNIST dataset load_mnist.cuhheader file is used
Image of a few predictions and accuracy

- Unoptimized CUDA Kernels : The current GPU implementation prioritizes clarity, modularity, and reusability over low-level performance optimizations. It does not use Shared memory, Tiling strategies and Fused kernels but these can be added later for performance-critical use cases.
- Lack of Training Metrics : The model currently reports only the loss after each epoch. It does not track or display Accuracy (Of Each Epoch), Validation performance, Training time or speed metrics
- Only few activation and loss functions are available.