The Naïve Bayes classifier is a supervised machine learning algorithm that is used for classification tasks such as text classification. They use principles of probability to perform classification tasks.
Naïve Bayes is part of a family of generative learning algorithms, meaning that it seeks to model the distribution of inputs of a given class or category. Unlike discriminative classifiers, like logistic regression, it does not learn which features are most important to differentiate between classes.
Naïve Bayes is also known as a probabilistic classifier since it is based on Bayes’ Theorem. It would be difficult to explain this algorithm without explaining the basics of Bayesian statistics. This theorem, also known as Bayes’ Rule, allows us to “invert” conditional probabilities. As a reminder, conditional probabilities represent the probability of an event given some other event has occurred, which is represented with the following formula:
Bayes’ Theorem is distinguished by its use of sequential events, where additional information later acquired impacts the initial probability. These probabilities are denoted as the prior probability and the posterior probability. The prior probability is the initial probability of an event before it is contextualized under a certain condition, or the marginal probability. The posterior probability is the probability of an event after observing a piece of data.
A popular example in statistics and machine learning literature (link resides outside ibm.com) to demonstrate this concept is medical testing. For instance, imagine there is an individual, named Jane, who takes a test to determine if she has diabetes. Let’s say that the overall probability having diabetes is 5%; this would be our prior probability. However, if she obtains a positive result from her test, the prior probability is updated to account for this additional information, and it then becomes our posterior probability. This example can be represented with the following equation, using Bayes’ Theorem:
However, since our knowledge of prior probabilities is not likely to exact given other variables, such as diet, age, family history, et cetera, we typically leverage probability distributions from random samples, simplifying the equation to P(Y|X) = P(X|Y)P(Y) / P(X)
Naïve Bayes classifiers work differently in that they operate under a couple of key assumptions, earning it the title of “naïve”. It assumes that predictors in a Naïve Bayes model are conditionally independent, or unrelated to any of the other feature in the model. It also assumes that all features contribute equally to the outcome. While these assumptions are often violated in real-world scenarios (e.g. a subsequent word in an e-mail is dependent upon the word that precedes it), it simplifies a classification problem by making it more computationally tractable. That is, only a single probability will now be required for each variable, which, in turn, makes the model computation easier. Despite this unrealistic independence assumption, the classification algorithm performs well, particularly with small sample sizes.
With that assumption in mind, we can now reexamine the parts of a Naïve Bayes classifier more closely. Similar to Bayes’ Theorem, it’ll use conditional and prior probabilities to calculate the posterior probabilities using the following formula:
Now, let’s imagine text classification use case to illustrate how the Naïve Bayes algorithm works. Picture an e-mail provider that is looking to improve their spam filter. The training data would consist of words from e-mails that have been classified as either “spam” or “not spam”. From there, the class conditional probabilities and the prior probabilities are calculated to yield the posterior probability. The Naïve Bayes classifier will operate by returning the class, which has the maximum posterior probability out of a group of classes (i.e. “spam” or “not spam”) for a given e-mail. This calculation is represented with the following formula:
Since each class is referring to the same piece of text, we can actually eliminate the denominator from this equation, simplifying it to:
The accuracy of the learning algorithm based on the training dataset is then evaluated based on the performance of the test dataset.
To unpack this a little more, we’ll go a level deeper to the individual parts, which comprise this formula. The class-conditional probabilities are the individual likelihoods of each word in an e-mail. These are calculated by determining the frequency of each word for each category—i.e. “spam” or “not spam”, which is also known as the maximum likelihood estimation (MLE). In this example, if we were examining if the phrase, “Dear Sir”, we’d just calculate how often those words occur within all spam and non-spam e-mails. This can be represented by the formula below, where y is “Dear Sir” and x is “spam”.
The prior probabilities are exactly what we described earlier with Bayes’ Theorem. Based on the training set, we can calculate the overall probability that an e-mail is “spam” or “not spam”. The prior probability for class label, “spam”, would be represented within the following formula:
The prior probability acts as a “weight” to the class-conditional probability when the two values are multiplied together, yielding the individual posterior probabilities. From there, the maximum a posteriori (MAP) estimate is calculated to assign a class label of either spam or not spam. The final equation for the Naïve Bayesian equation can be represented in the following ways:
Alternatively, it can be represented in the log space as naïve bayes is commonly used in this form:
One way to evaluate your classifier is to plot a confusion matrix, which will plot the actual and predicted values within a matrix. Rows generally represent the actual values while columns represent the predicted values. Many guides will illustrate this figure as a 2 x 2 plot, such as the below:
However, if you were predicting images from zero through 9, you’d have a 10 x 10 plot. If you wanted to know the number of times that classifier “confused” images of 4s with 9s, you’d only need to check the 4th row and the 9th column.
There isn’t just one type of Naïve Bayes classifier. The most popular types differ based on the distributions of the feature values. Some of these include:
All of these can be implemented through the Scikit Learn (link resides outside ibm.com) Python library (also known as sklearn).
Along with a number of other algorithms, Naïve Bayes belongs to a family of data mining algorithms which turn large volumes of data into useful information. Some applications of Naïve Bayes include:
IBM® Granite™ is our family of open, performant and trusted AI models, tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.
We surveyed 2,000 organizations about their AI initiatives to discover what's working, what's not and how you can get ahead.
Explore supervised learning approaches such as support vector machines and probabilistic classifiers.
Learn fundamental concepts and build your skills with hands-on labs, courses, guided projects, trials and more.
Learn how to select the most suitable AI foundation model for your use case.
Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.
Put AI to work in your business with IBM's industry-leading AI expertise and portfolio of solutions at your side.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.