Spam classifier in Go using Naive Bayes

1 hour ago 1

A Naive Bayes spam classifier implementation in Go, enabling text classification system using the Naive Bayes algorithm with Laplace smoothing to classify messages as spam or not spam.

  • Naive Bayes Classification: Uses probabilistic classification based on Bayes' theorem with naive independence assumptions
  • Laplace Smoothing: Implements additive smoothing to handle zero probabilities for unseen words
  • Training & Classification: Simple API for training on labeled datasets and classifying new messages
  • Real Dataset Testing: Includes tests with actual spam/ham email datasets
go get github.com/igomez10/nspammer
package main import ( "fmt" "github.com/igomez10/nspammer" ) func main() { // Create training dataset (map[string]bool where true = spam, false = not spam) trainingData := map[string]bool{ "buy viagra now": true, "get rich quick": true, "meeting at 3pm": false, "project update report": false, } // Create and train classifier classifier := nspammer.NewSpamClassifier(trainingData) // Classify new messages isSpam := classifier.Classify("buy now") fmt.Printf("Is spam: %v\n", isSpam) }

NewSpamClassifier(dataset map[string]bool) *SpamClassifier

Creates a new spam classifier and trains it on the provided dataset. The dataset is a map where keys are text messages and values indicate whether the message is spam (true) or not spam (false).

(*SpamClassifier).Classify(input string) bool

Classifies the input text as spam (true) or not spam (false) based on the trained model.

The classifier uses the Naive Bayes algorithm:

  1. Training Phase:

    • Calculates prior probabilities: P(spam) and P(not spam)
    • Builds a vocabulary from all training messages
    • Counts word occurrences in spam and non-spam messages
    • Stores word frequencies for likelihood calculations
  2. Classification Phase:

    • Calculates log probabilities to avoid numerical underflow
    • Computes: log(P(spam)) + Σ log(P(word|spam))
    • Computes: log(P(not spam)) + Σ log(P(word|not spam))
    • Returns true (spam) if the spam score is higher
  3. Laplace Smoothing:

    • Adds a smoothing constant to avoid zero probabilities for unseen words
    • Formula: P(word|class) = (count + α) / (total + α × vocabulary_size)
    • Default α = 1.0

The project includes support for the Kaggle Spam Mails Dataset. To download it:

This script requires the Kaggle CLI to be installed and configured.

Run the test suite:

The tests include:

  • Simple classification examples
  • Real-world email dataset evaluation
  • Accuracy measurements on train/test splits
Read Entire Article