7  Naive Bayes

7.1 Overview

The Naive Bayes method of classification provides a fairly simple and accessible framework under which to estimate the probability that a subject is a member of a certain class, given prior evidence within one’s dataset. The methodology is commonly used in recommendation systems (e.g. streaming services, online gaming services, e-commerce, and many more) to quickly identify and suggest new actions, items, or activity to a user based on their past actions or activity.
The simplicity of the algorithm makes it an excellent candidate for assessing categorical data (in the case of multinomial Naive Bayes), as well as normally-distributed numerical data (for Gaussian Naive Bayes), and Bernoulli distributed or binary data (as for Bernoulli Naive Bayes). The assumptions made in the algorithm render it simple to implement, and to produce models with a reliable degree of performance.

7.1.1 What makes it Naive?

The following equations outline different formulations of Bayes’ Rule. In these formulations, let X represent the Data, and Y represent the Outcome or Label for the data.

\[ P(X|Y) = \frac{P(Y|X)\cdot P(X)}{P(Y)} \tag{7.1}\]

\[ P(X|Y) = \frac{P(X\cap Y)}{P(Y)} \tag{7.2}\]

\[ P(Y|X) = \frac{P(X|Y)\cdot P(X)}{P(Y)} \tag{7.3}\]

\[ P(Y|X) = \frac{P(X\cap Y)}{P(X)} \tag{7.4}\]

Bayes’ rule has no stipulation that X and Y be independent from one another in the source data. The requirement is that the input probability for \(P(X)\), \(P(Y)\), and \(P(X\cap Y)\) be the actual probability for each variable or combination thereof.

The algorithm is Naive because it assumes all features or input variables are independent of one another, or that each features’ probability does not directly impact one another. In this assumption, it means one can leverage the probability rule for independence between two variables:

\[ P(X\cap Y) = P(X)\cdot P(Y)\iff \text{X,Y are independent} \tag{7.5}\]

In the world of statistics, it can be quite rare to encounter a collected dataset in which all features are truly independent of one another.

Many novice, introductory statistics students sometimes make the assumption that variables are independent while working out problems in their homeworks, quizzes, or exams. This leads to incorrect responses, but makes the problem doable in relatively short order. In doing so, those students demonstrate their naivety. Since it makes the same assumption and performs no work to confirm or refute the claim, Naive Bayes is similarly naive.

This allows for the transformation of Equation 7.3, Equation 7.4, and Equation 7.5 to calculate probabilites when fitting data to a model and predicting the class of a new record:

\[ P(Y|X) = \frac{P(X_1|Y)\cdot P(X_2|Y)\cdot ...\cdot P(X_n|Y)}{P(Y)} \tag{7.6}\]

\[ P(Y|X) = \frac{P(X_1\cap Y)\cdot P(X_2 \cap Y)\cdot ...\cdot P(X_n\cap Y)}{P(Y)} \tag{7.7}\]

\[ P(Y|X) = \frac{\prod\limits_{i=1}^n P(X_i\cap Y)}{P(X)} \tag{7.8}\]

In Equation 7.8, one can see that the probability of belonging to a certain class, given an input record, is the product of the probabilities of each individual feature holding the same category for \(X_j\) when the class is \(Y\).

Because of the product formulation in Equation 7.8 above, it can be benefical to represent the equation using logarithms, rendering a simpler and more easily calculated implementation:

\[ \text{log}(P(Y|X)) = \frac{\text{log}\sum\limits_{i=1}^n P(X_i\cap Y)}{P(X)} \tag{7.9}\]

7.1.2 How does it work?

In short, prior probabilities are calculated as follows from a training dataset:

\[ P(X|Y) = \frac{P(X\cap Y)}{P(Y)} \]

The above is calculated for all features \(X_i\) in \(X\).

From there, these probabilities are applied in Equation 7.8 for every possible class outcome \(y_i\). The maximum amongst these probabilites is selected as the predicted class.

A simple way to understand Naive Bayes is as a calculation of the combined relative frequency for all features or variables in consideration when records belong to a specific class. These relative frequencies are treated as overall probabities, and are pre-calculated for a Naive Bayes model using a training dataset. When new records are introduced to the model for classification, Naive Bayes outputs the class that has the greatest probability using the pre-computed prior probabilites in conjunction with the new records’ feature data.

Consider the below example:

For this example, a dummy dataset is generated with 400 records. There are 4 classifications - R, G, Y, and B. There are 3 features for evaluation:

  • size, which takes on possible values of “S”, “M”, or “L”

  • cost, which takes on possible values of “free”,“low”,‘medium’, or ‘high’

  • tested, which is a boolean - True/False value

Below are the first 10 records of this dummy dataset:

Table 7.1: randomly generated dataset, first 10 records
class size cost tested
R S low False
R S low False
R S high False
R M free False
R M low False
R L free True
R L high False
R M free False
R S free False
R S free False

To produce the probability tables, we must first calculate the prior probabilities for each of the classes in our label column, classes.

From here, we can iterate on the remaining columns with respect to the label column. In our case, we have the columns of cost, size, and tested.

The general process is as follows:

  • select the feature column

  • for each unique value within the feature column, \(X_j\):

    • for each unique class in our labels:

      • get the raw count of the number of records where the feature column is equal to the unique value and the class is equal to the current class. This is a proxy for \(P(X_i|Y)\)

      • calculate the total raw of each class in the resulting filter, analogous to \(P(Y)\)

      • use the above calulations to calculate \(P(Y_i|X)\) as listed in Equation 7.2

      • select the highest value for \(P(Y_i|X)\) to select the class prediction

Using this, we can precompute relative frequencies and probabilities \(P(Y|X_i)\) for each value, given that the record is a member of the target class:

Table 7.2: relative frequency of size, given class
size class count prob
S B 34 0.272000
M B 43 0.344000
L B 48 0.384000
S G 31 0.269565
M G 47 0.408696
L G 37 0.321739
S R 25 0.333333
M R 27 0.360000
L R 23 0.306667
S Y 36 0.423529
M Y 23 0.270588
L Y 26 0.305882
Table 7.3: relative frequency of cost, given class
cost class count prob
low B 24 0.192000
high B 26 0.208000
free B 38 0.304000
medium B 37 0.296000
low G 33 0.286957
high G 22 0.191304
free G 28 0.243478
medium G 32 0.278261
low R 18 0.240000
high R 24 0.320000
free R 19 0.253333
medium R 14 0.186667
low Y 20 0.235294
high Y 19 0.223529
free Y 19 0.223529
medium Y 27 0.317647
Table 7.4: relative frequency of tested, given class
tested class count prob
False B 56 0.448000
True B 69 0.552000
False G 64 0.556522
True G 51 0.443478
False R 43 0.573333
True R 32 0.426667
False Y 46 0.541176
True Y 39 0.458824

These probability tables are pre-computed and used for future classification tasks. We directly treat these as independent probabilities and on introduction of a new record, we simply perform the following tasks:

  • For every possible result classification, \(y_i\) (in our case, G, B, Y, and R):

    • Take the new input record

    • Start with prob = prior probability for each class (the relative frequency of each class \(y_i\) in the source data)

    • Iterate through the pre-computed probability tables for size, cost, and tested:

      • get the value of the probability column where the class is the current possible result classification \(y_i\) and the record feature value is equal to the table feature value.

      • multiply the current value for prob by the selected value. This is analogous to Equation 7.8 for the current class \(y_i\)

    • after completion we have \(P(Y=y_i|X)\)

  • now that we’ve collected all \(P(Y|X)\) for every \(y_i\), one selects the maximum value amongst all \(P(Y=y_i|X)\) and set the predicted class as \(y_i\)

Examining this process for the below example record:

Table 7.5: New Record for Testing the Model
size cost tested
L medium True

To perform the calculation, one can initialize the data with the prior probability \(P(Y)\) for each potential output class.

Table 7.6: Prior probabilites (P(Y))
B G Y R
0.312500 0.287500 0.212500 0.187500

From here, one must examine the values in each table, where the column in Table 7.5 column is equal to the same column in each of our pre-computed tables, Table 7.2,

Table 7.7: Probabilites from Size Frequencies where size=L
B G Y R
0.384000 0.321739 0.305882 0.306667

for Table 7.3,

Table 7.8: Probabilites from Size Frequencies where cost=medium
B G Y R
0.296000 0.278261 0.317647 0.186667

and for Table 7.4

Table 7.9: Probabilites from Tested Frequencies where tested=True
B G Y R
0.552000 0.443478 0.458824 0.426667

Taking each of these (the prior probabilities and the extracted probabilities), one can calculate the probability \(P(Y_i|X))\) for the input record

Table 7.10: Calculation of P(Y|X)
class B G Y R
prob 0.019607 0.011415 0.009473 0.00458

The final result, one can see, is that the classification of the record will be “B”, as it has the highest probability amongst all the potential classes.

7.1.3 The Zero-Frequency Problem

One challenge with the Naive Bayes algorithm is when one or more of the feature spaces has zero occurences within a given output class. When this occurs, it sets the probability for a record being a member of that class to zero, meaning that no newly introduced data or records can ever be classified as a member of that class by the algorithm. Without additional data and training, the model will never adapt to these new inputs, because the calculation is a running product of the relative frequencies. Thus, a single relative frequency of zero will result in an overall probability of zero, and the output class will not be predicted.

What if no records of class Y in the source data were ever tested?

tested class count prob
False B 56 0.448000
True B 69 0.552000
False G 64 0.556522
True G 51 0.443478
False R 43 0.573333
True R 32 0.426667
False Y 85 1.000000
True Y 0 0.000000

One can see that the change of all records where class=Y and tested=True have a probability of zero.

This has a substatial impact on predictions, namely - whenever a record has tested=True, the trained model will produce a zero probability result for the class of a new record belonging to Y, and thus no new records will ever be classified as Y. Here’s the predicted outcome for the record in the previous example:

size cost tested
L medium True

We see the record has tested = True, so the outcome will never be class Y:

R G Y B
0.004580 0.011415 0.000000 0.019607

The absence of a probability is, well, problematic. If the new record truly did belong to class Y, the model can never predict it as belonging to the class, due to the absence of data. There are methods and means of handling this issue, however.

One option includes updating an existing Naive Bayes model in relatively short order with new records, new training set information, and recomputing the prior probabilities for classification of future records. Without such additional data and retraining/updating, however, the classification challenge will remain.

Another option to rectify the zero-frequency issue is via smoothing methods. These methods allow for any possibility to occur, and assign minute, non-zero probabilities to any cases which have zero frequency within the training dataset. In doing this it ensures that, for every value of every considered feature, there is some probability \(p\) strictly greater than zero assigned, thus allowing potential predictions into the appropriate class for any new input tuple.

A common smoothing technique for Multinomial Naive Bayes is Laplace smoothing. The technique adapts the calculation of the probabilites for each feature \(P(X_i|Y)\) in the following manner:

\[ P(X|Y) = \frac{P(Y|X) + \alpha}{P(Y)+\alpha n} \tag{7.10}\]

Where \(\alpha\) is equal to 1, and n is the number of total categories \(Y\) in the dataset. There are The additive non-zero values to the numerator \(\alpha\) and the denominator \(\alpha n\) ensures that the probabilities for all \(P(X_i|Y)\) are greater than zero, thus enabling them to (potentially) be predicted by the model. It’s possible that a single, and potentially less important, feature could be the difference between whether or not the correct class can be predicted absent such smoothing, but the inclusion of Laplace or other smoothing techniques can support better modeling.

Thus far, everything explained above examines Categorical naive bayes. There are several other versions of the Naive Bayes algorithm that operate under the same independence assumption, but algorithmic performance differs from what has been explained thus far.

7.1.4 Bernoulli Naive Bayes

This algorithm performs similarly to the Multinomial Naive Bayes algorithm, but on binary encoded (0/1) data. Every category in each feature needs to be transferred into a column, and then each column is set to a zero if a record does not have the specified category value for that feature, and a one otherwise.

However, the probabilities are calculated differently from that of the Multinomial Naive Bayes Algorithm. For calculating the values of the prior probabilities in the training data, the formula is repaced as follows:

\[ P(x_i|y) = P(x_i=1|y)\cdot x_i + (1-P(x_i=1|y))\cdot(1-x_i) \]

This is similar to the construct of the binomial distribution (chance for k successes in n trials), and is repeated for each feature value \(x_i\) in the dataset. This is of benefit to Bernoulli naive bayes over Multinomial, as it will include and penalize non-occuring combinations of \(x_i\) and \(y\) together in the dataset.

Some advantages of Bernoulli naive bayes are that it is relatively simple (for small datasets) to implement and that it performs well on tasks such as task classification (e.g. detecting spam emails, for instance). However, it is only able to categorize or predict on a binary outcome, similar to logistic regression. This is applicable for this research, but may not fit all use cases.

7.2 Data and Code

To prepare the data, several steps were necessary. Namely, the data for this research effort is of mixed (categorical and quantitative) types. Different data transformation techniques and algorithms were required for application to source data to place it in a usable format for each Naive Bayes algorithm.

For all model code, in addition to building 2 models with and without protected class information, an exploration into statistically significant differences in model performance was performed. The experiment is constructed under the following parameters:

  • For each naive bayes model type, repeat the following steps, 500 times:

    • initialize a random seed

    • sample the records from both datasets (with and without protected class information) using the same random seed so as to pull the same records, on an 80/20 train/test split. This ensures the same subjects are present in the training and testing data for each random sampling.

    • train two models using the training data

      • in the event of an error in fitting or predicting under the current train/test split, decrement the loop counter so that we ensure 500 measurements and restart the loop.
    • predict the outcomes using the testing data

    • capture model performance metrics (accuracy, precision, recall, F1, ROC-AUC) for the two models trained with and without protected class information (age/gender/race).

  • After metrics for 500 models are captured:

    • Construct a paired-t test for difference in means between the two models’ performance metrics, on a per-metric basis.

      • \(H_0\): There is no difference in the mean performance metric for models when trained with and without protected class information

      • \(H_A\): One of the models has a higher mean performance metric when protected class information is included

      • \(\alpha=0.003\) or a \(3\sigma\) confidence

    • Visualize the distribution of performance metrics for each model

    • Conclude on any statistically significant differences between the models’ performance

The output for these tests are located as follows:

7.2.1 Multinomial Naive Bayes

The data prepration and code were executed natively in Appendix G, sourcing from the final clean dataset.

Multinomial Naive Bayes requires count data. Since the native format of the initially cleaned dataset is purely record data, there are challenges to convert this to count information. Namely, each feature and feature value either occurs, or it doesn’t. As such, the data had to be transformed into a one-hot encoded dataset with either 1 if the feature value occurred in the record, and 0 otherwise. This same data is leveraged in Bernoulli Naive Bayes for this portion of the research.

Table 7.11: Initial Data Used
state_code county_code derived_sex action_taken purchaser_type preapproval open-end_line_of_credit loan_amount loan_to_value_ratio interest_rate ... tract_median_age_of_housing_units applicant_race co-applicant_race applicant_ethnicity co-applicant_ethnicity aus denial_reason outcome company income_from_median
0 OH 39153.0 Sex Not Available 1 0 2 2 665000.0 85.000 4.250 ... 36 32768 131072 32 128 64 512 1.0 JP Morgan True
1 NY 36061.0 Male 1 0 2 2 755000.0 21.429 4.250 ... 0 32768 262144 64 256 64 512 1.0 JP Morgan False
2 NY 36061.0 Sex Not Available 1 0 1 2 965000.0 80.000 5.250 ... 0 65536 262144 64 256 64 512 1.0 JP Morgan False
3 FL 12011.0 Male 1 0 2 2 705000.0 92.175 5.125 ... 12 32768 262144 32 256 64 512 1.0 JP Morgan False
4 MD 24031.0 Joint 1 0 2 2 1005000.0 65.574 5.625 ... 69 66 32768 32 32 64 512 1.0 JP Morgan False
5 NC 37089.0 Joint 1 0 1 2 695000.0 85.000 6.000 ... 39 32768 32768 32 32 64 512 1.0 JP Morgan False
6 CA 6073.0 Joint 2 0 2 2 905000.0 75.000 6.250 ... 44 2 2 32 32 64 512 1.0 JP Morgan False
7 NY 36061.0 Sex Not Available 2 0 1 2 355000.0 15.909 5.625 ... 63 65536 65536 64 64 64 512 1.0 JP Morgan False
8 NY 36061.0 Joint 1 0 1 2 1085000.0 90.000 5.625 ... 75 32768 32768 32 32 64 512 1.0 JP Morgan True
9 MO 29189.0 Sex Not Available 2 0 1 2 405000.0 53.333 5.750 ... 0 65536 65536 64 64 64 512 1.0 JP Morgan True

10 rows × 51 columns

Table 7.12: MultinomialNB Training Data (With protected classes)
derived_sex_Female derived_sex_Joint derived_sex_Male derived_sex_Sex Not Available purchaser_type_0 purchaser_type_1 purchaser_type_3 purchaser_type_5 purchaser_type_6 purchaser_type_9 ... co-applicant_ethnicity_No Co-applicant aus_Desktop Underwriter aus_Loan Prospector/Product Advisor aus_TOTAL Scorecard aus_GUS aus_Other aus_Internal Proprietary aus_Not applicable aus_Exempt outcome
149746 1 0 0 0 0 1 0 0 0 0 ... 1 1 0 0 0 0 0 0 0 1.0
105015 0 0 1 0 1 0 0 0 0 0 ... 1 1 0 0 0 0 0 0 0 1.0
29094 0 1 0 0 0 1 0 0 0 0 ... 0 1 1 0 0 1 0 0 0 1.0
101082 0 0 1 0 1 0 0 0 0 0 ... 1 1 0 0 0 0 0 0 0 1.0
77750 1 0 0 0 1 0 0 0 0 0 ... 1 0 0 0 0 0 1 0 0 1.0
197336 0 1 0 0 1 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0.0
137650 0 0 0 1 0 1 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 1.0
136772 1 0 0 0 0 1 0 0 0 0 ... 1 1 0 0 0 0 0 0 0 1.0
41734 1 0 0 0 1 0 0 0 0 0 ... 1 0 1 0 0 0 1 0 0 0.0
10710 1 0 0 0 0 0 1 0 0 0 ... 1 0 1 0 0 1 0 0 0 1.0

10 rows × 244 columns

Table 7.13: MultinomialNB Testing Data (With protected classes)
derived_sex_Female derived_sex_Joint derived_sex_Male derived_sex_Sex Not Available purchaser_type_0 purchaser_type_1 purchaser_type_3 purchaser_type_5 purchaser_type_6 purchaser_type_9 ... co-applicant_ethnicity_No Co-applicant aus_Desktop Underwriter aus_Loan Prospector/Product Advisor aus_TOTAL Scorecard aus_GUS aus_Other aus_Internal Proprietary aus_Not applicable aus_Exempt outcome
30860 0 0 1 0 0 0 1 0 0 0 ... 1 1 1 0 0 1 0 0 0 1.0
126890 0 0 0 1 0 0 1 0 0 0 ... 1 1 0 0 0 0 0 0 0 1.0
28730 0 0 1 0 0 0 1 0 0 0 ... 1 1 1 0 0 1 0 0 0 1.0
31244 0 0 1 0 0 0 1 0 0 0 ... 1 1 1 0 0 1 0 0 0 1.0
56105 0 0 1 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 1.0
5443 0 0 1 0 0 0 1 0 0 0 ... 1 0 1 0 0 0 0 0 0 1.0
95702 0 1 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 1.0
83812 0 1 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0.0
84338 0 1 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 1.0
124545 0 1 0 0 0 0 1 0 0 0 ... 0 1 0 0 0 0 0 0 0 1.0

10 rows × 244 columns

Table 7.14: MultinomialNB Training Data (No protected classes)
purchaser_type_0 purchaser_type_1 purchaser_type_3 purchaser_type_5 purchaser_type_6 purchaser_type_9 purchaser_type_71 preapproval_1 preapproval_2 open-end_line_of_credit_1 ... tract_median_age_of_housing_units_H tract_median_age_of_housing_units_M tract_median_age_of_housing_units_MH tract_median_age_of_housing_units_ML company_Bank of America company_JP Morgan company_Navy Federal Credit Union company_Rocket Mortgage company_Wells Fargo outcome
149746 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0
105015 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0
29094 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
101082 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0
77750 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0
197336 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
137650 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0
136772 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0
41734 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
10710 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0

10 rows × 176 columns

Table 7.15: MultinomialNB Training Data (No protected classes)
purchaser_type_0 purchaser_type_1 purchaser_type_3 purchaser_type_5 purchaser_type_6 purchaser_type_9 purchaser_type_71 preapproval_1 preapproval_2 open-end_line_of_credit_1 ... tract_median_age_of_housing_units_H tract_median_age_of_housing_units_M tract_median_age_of_housing_units_MH tract_median_age_of_housing_units_ML company_Bank of America company_JP Morgan company_Navy Federal Credit Union company_Rocket Mortgage company_Wells Fargo outcome
30860 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
126890 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0
28730 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0
31244 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
56105 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0
5443 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0
95702 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0
83812 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
84338 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0
124545 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0

10 rows × 176 columns

Also of note: the record indexes in Table 9.6 and Table 9.8 match one another, meaning that the two models are trained on the same subjects. Similarly, the indexes between Table 9.7 and Table 9.9 match, so they are tested on the same subjects between the two models.

Also note that the indexes between Table 9.6 and Table 9.7 are disjoint - that means that the model has disjoint training and testing data. Similarly, Table 9.8 and Table 9.9 are disjoint.

By achieving these splits, the two models evaluated with and without protected class information will avoid unnecessary biases in the results. When a model is tested on data with which it has already been trained, the model has already optimized to the best of its ability to correctly classify the training data. As such, the outcome of an evaluation of a model using the same data in training and testing will artificially inflate its performance metrics (accuracy, precision, recall, F1, ROC-AUC). As such, it is paramount to have a disjoint training and testing dataset.

7.2.2 Bernoulli Naive Bayes

The data preparation was executed in Appendix G, sourced from the final clean dataset. To prepare the data for two different models, 2 copies were made. The second copy dropped all columns that included protected class information, and the first retained all source columns.

To perform an MCA, the data must be in a one-hot-encoded (or binary 0/1) format per category and feature. Since this work was previously done, the steps were seamless to include for this purpose.

Examples of the two dataset samples used for Bernoulli Naive Bayes are summarized in Table 9.6, Table 9.7, Table 9.8, and Table 9.9.

7.2.3 Categorical Naive Bayes

The data for categorical naive bayes was sourced from the final clean dataset.

In Appendix D, the data is transformed into label-encoded format to perform the categorical naive bayes analysis. In some cases, binary data was captured (mostly in terms of protected class information). These binary caputred data was parsed back out into featurename:featurevalue columns with a 1 if the record met the condition, and zero otherwise. This format still meets the needs of categorical naive bayes.

Table 7.16: CategoricalNB Training Data (With protected classes)
derived_sex preapproval open-end_line_of_credit loan_amount loan_to_value_ratio interest_rate total_loan_costs origination_charges discount_points lender_credits ... co-applicant_ethnicity_No Co-applicant aus_Desktop Underwriter aus_Loan Prospector/Product Advisor aus_TOTAL Scorecard aus_GUS aus_Other aus_Internal Proprietary aus_Not applicable aus_Exempt outcome
149746 0 1 1 1 2 2 1 1 1 1 ... 1 1 0 0 0 0 0 0 0 1.0
105015 2 1 1 1 2 2 1 1 1 1 ... 1 1 0 0 0 0 0 0 0 1.0
29094 1 1 1 1 2 2 1 1 1 1 ... 0 1 1 0 0 1 0 0 0 1.0
101082 2 1 1 1 2 2 0 0 0 1 ... 1 1 0 0 0 0 0 0 0 1.0
77750 0 1 1 2 1 2 1 1 1 1 ... 1 0 0 0 0 0 1 0 0 1.0
197336 1 1 1 1 2 2 1 1 1 1 ... 0 0 1 0 0 0 0 0 0 0.0
137650 3 1 1 2 2 0 3 1 1 1 ... 0 1 0 0 0 0 0 0 0 1.0
136772 0 1 1 1 4 2 1 2 1 1 ... 1 1 0 0 0 0 0 0 0 1.0
41734 0 1 1 1 1 2 1 1 1 1 ... 1 0 1 0 0 0 1 0 0 0.0
10710 0 1 1 1 2 2 1 1 1 1 ... 1 0 1 0 0 1 0 0 0 1.0

10 rows × 101 columns

Table 7.17: CategoricalNB Training Data (With protected classes)
derived_sex preapproval open-end_line_of_credit loan_amount loan_to_value_ratio interest_rate total_loan_costs origination_charges discount_points lender_credits ... co-applicant_ethnicity_No Co-applicant aus_Desktop Underwriter aus_Loan Prospector/Product Advisor aus_TOTAL Scorecard aus_GUS aus_Other aus_Internal Proprietary aus_Not applicable aus_Exempt outcome
30860 2 1 1 2 2 2 3 1 1 1 ... 1 1 1 0 0 1 0 0 0 1.0
126890 3 1 1 3 2 3 1 1 1 1 ... 1 1 0 0 0 0 0 0 0 1.0
28730 2 1 1 1 2 3 3 1 1 1 ... 1 1 1 0 0 1 0 0 0 1.0
31244 2 1 1 2 2 2 0 0 0 1 ... 1 1 1 0 0 1 0 0 0 1.0
56105 2 1 1 3 1 2 1 1 1 1 ... 0 0 1 0 0 0 0 0 0 1.0
5443 2 1 1 3 2 2 1 1 1 1 ... 1 0 1 0 0 0 0 0 0 1.0
95702 1 1 1 1 3 0 1 1 1 1 ... 0 0 0 0 0 0 0 1 0 1.0
83812 1 1 1 2 2 2 1 1 1 1 ... 0 0 0 0 0 0 1 0 0 0.0
84338 1 1 1 1 1 2 3 1 3 1 ... 0 0 1 0 0 0 0 0 0 1.0
124545 1 1 1 1 2 2 1 1 1 1 ... 0 1 0 0 0 0 0 0 0 1.0

10 rows × 101 columns

Table 7.18: CategoricalNB Training Data (No protected classes)
preapproval open-end_line_of_credit loan_amount loan_to_value_ratio interest_rate total_loan_costs origination_charges discount_points lender_credits loan_term ... company aus_Desktop Underwriter aus_Loan Prospector/Product Advisor aus_TOTAL Scorecard aus_GUS aus_Other aus_Internal Proprietary aus_Not applicable aus_Exempt outcome
149746 1 1 1 2 2 1 1 1 1 2 ... 3 1 0 0 0 0 0 0 0 1.0
105015 1 1 1 2 2 1 1 1 1 2 ... 2 1 0 0 0 0 0 0 0 1.0
29094 1 1 1 2 2 1 1 1 1 2 ... 1 1 1 0 0 1 0 0 0 1.0
101082 1 1 1 2 2 0 0 0 1 2 ... 2 1 0 0 0 0 0 0 0 1.0
77750 1 1 2 1 2 1 1 1 1 2 ... 4 0 0 0 0 0 1 0 0 1.0
197336 1 1 1 2 2 1 1 1 1 2 ... 3 0 1 0 0 0 0 0 0 0.0
137650 1 1 2 2 0 3 1 1 1 2 ... 3 1 0 0 0 0 0 0 0 1.0
136772 1 1 1 4 2 1 2 1 1 2 ... 3 1 0 0 0 0 0 0 0 1.0
41734 1 1 1 1 2 1 1 1 1 2 ... 0 0 1 0 0 0 1 0 0 0.0
10710 1 1 1 2 2 1 1 1 1 2 ... 1 0 1 0 0 1 0 0 0 1.0

10 rows × 34 columns

Table 7.19: CategoricalNB Training Data (No protected classes)
preapproval open-end_line_of_credit loan_amount loan_to_value_ratio interest_rate total_loan_costs origination_charges discount_points lender_credits loan_term ... company aus_Desktop Underwriter aus_Loan Prospector/Product Advisor aus_TOTAL Scorecard aus_GUS aus_Other aus_Internal Proprietary aus_Not applicable aus_Exempt outcome
30860 1 1 2 2 2 3 1 1 1 2 ... 1 1 1 0 0 1 0 0 0 1.0
126890 1 1 3 2 3 1 1 1 1 2 ... 3 1 0 0 0 0 0 0 0 1.0
28730 1 1 1 2 3 3 1 1 1 2 ... 1 1 1 0 0 1 0 0 0 1.0
31244 1 1 2 2 2 0 0 0 1 2 ... 1 1 1 0 0 1 0 0 0 1.0
56105 1 1 3 1 2 1 1 1 1 1 ... 0 0 1 0 0 0 0 0 0 1.0
5443 1 1 3 2 2 1 1 1 1 2 ... 1 0 1 0 0 0 0 0 0 1.0
95702 1 1 1 3 0 1 1 1 1 2 ... 2 0 0 0 0 0 0 1 0 1.0
83812 1 1 2 2 2 1 1 1 1 2 ... 4 0 0 0 0 0 1 0 0 0.0
84338 1 1 1 1 2 3 1 3 1 2 ... 4 0 1 0 0 0 0 0 0 1.0
124545 1 1 1 2 2 1 1 1 1 2 ... 3 1 0 0 0 0 0 0 0 1.0

10 rows × 34 columns

Just as for MultinomialNB, the data for CategoricalNB has the similar combination of same indexes (between datasets) and disjoint indexes (between train test splits).

7.3 Results

7.3.1 Categorical Naive Bayes

Figure 7.1: Confusion Matrices (CategoricalNB, Single-Run)
Table 7.20: Model Performance Metrics (CategoricalNB, Single Run)
Model Data Accuracy Precision Recall F1 ROC-AUC
CategoricalNB With Protected Classes 0.882012 0.943416 0.917050 0.930046 0.795993
CategoricalNB Without Protected Classes 0.912455 0.934819 0.964922 0.949632 0.783651
Figure 7.2: Metric Kernel Density Estimates (500 randomizations, CategoricalNB)
Table 7.21: Statistical Significance Tests (Model Performance Metrics, CategoricalNB)
Model Stat z-score p-value top performer top mean difference in means
CategoricalNB Accuracy -8.271750 0.000000 Without Protected Classes 0.883466 0.031069
CategoricalNB Precision 2.768951 0.005624 With Protected Classes 0.924105 0.006065
CategoricalNB Recall -21.365216 0.000000 Without Protected Classes 0.948421 0.046966
CategoricalNB F1 -9.276604 0.000000 Without Protected Classes 0.932983 0.020344
CategoricalNB ROC-AUC 1.049461 0.293966 With Protected Classes 0.731962 0.007958

7.3.2 Multinomial Naive Bayes

Figure 7.3: Confusion Matrices (MultinomialNB, Single-Run)
Table 7.22: Model Performance Metrics (MultinomialNB, Single Run)
Model Data Accuracy Precision Recall F1 ROC-AUC
MultinomialNB With Protected Classes 0.936358 0.967880 0.957361 0.962591 0.884798
MultinomialNB Without Protected Classes 0.939653 0.969309 0.959833 0.964548 0.890112
Figure 7.4: Metric Kernel Density Estimates (500 randomizations, MultinomialNB)
Table 7.23: Statistical Significance Tests (Model Performance Metrics, MultinomialNB)
Model Stat z-score p-value top performer top mean difference in means
MultinomialNB Accuracy -1.242399 0.214090 Without Protected Classes 0.804602 0.000175
MultinomialNB Precision 12.735062 0.000000 With Protected Classes 0.914013 0.000778
MultinomialNB Recall -7.701771 0.000000 Without Protected Classes 0.852538 0.001103
MultinomialNB F1 -2.489032 0.012809 Without Protected Classes 0.881842 0.000228
MultinomialNB ROC-AUC 10.578913 0.000000 With Protected Classes 0.689026 0.002104

7.3.3 Bernoulli Naive Bayes

Figure 7.5: Confusion Matrices (BernoulliNB, Single-Run)
Table 7.24: Model Performance Metrics (BernoulliNB, Single Run)
Model Data Accuracy Precision Recall F1 ROC-AUC
BernoulliNB With Protected Classes 0.937514 0.973560 0.952818 0.963077 0.899943
BernoulliNB Without Protected Classes 0.939875 0.977551 0.951553 0.964377 0.911205
Figure 7.6: Metric Kernel Density Estimates (500 randomizations, BernoulliNB)
Table 7.25: Statistical Significance Tests (Model Performance Metrics, BernoulliNB)
Model Stat z-score p-value top performer top mean difference in means
BernoulliNB Accuracy -53.447073 0.000000 Without Protected Classes 0.940754 0.003477
BernoulliNB Precision -44.319309 0.000000 Without Protected Classes 0.969658 0.002109
BernoulliNB Recall -31.047924 0.000000 Without Protected Classes 0.960794 0.001971
BernoulliNB F1 -52.811800 0.000000 Without Protected Classes 0.965205 0.002040
BernoulliNB ROC-AUC -49.602472 0.000000 Without Protected Classes 0.891554 0.007172

7.3.4 Overall

Table 7.26: Summary of single-run model performance outcomes
Model Data Accuracy Precision Recall F1 ROC-AUC
CategoricalNB With Protected Classes 0.882012 0.943416 0.917050 0.930046 0.795993
CategoricalNB Without Protected Classes 0.912455 0.934819 0.964922 0.949632 0.783651
MultinomialNB With Protected Classes 0.936358 0.967880 0.957361 0.962591 0.884798
MultinomialNB Without Protected Classes 0.939653 0.969309 0.959833 0.964548 0.890112
BernoulliNB With Protected Classes 0.937514 0.973560 0.952818 0.963077 0.899943
BernoulliNB Without Protected Classes 0.939875 0.977551 0.951553 0.964377 0.911205
Table 7.27: Summary of 500 Randomization Test Outcomes (All Model Types)
Model Stat z-score p-value top performer top mean difference in means
BernoulliNB Precision -44.319309 0.000000 Without Protected Classes 0.969658 0.002109
BernoulliNB F1 -52.811800 0.000000 Without Protected Classes 0.965205 0.002040
BernoulliNB Recall -31.047924 0.000000 Without Protected Classes 0.960794 0.001971
CategoricalNB Recall -21.365216 0.000000 Without Protected Classes 0.948421 0.046966
BernoulliNB Accuracy -53.447073 0.000000 Without Protected Classes 0.940754 0.003477
CategoricalNB F1 -9.276604 0.000000 Without Protected Classes 0.932983 0.020344
CategoricalNB Precision 2.768951 0.005624 With Protected Classes 0.924105 0.006065
MultinomialNB Precision 12.735062 0.000000 With Protected Classes 0.914013 0.000778
BernoulliNB ROC-AUC -49.602472 0.000000 Without Protected Classes 0.891554 0.007172
CategoricalNB Accuracy -8.271750 0.000000 Without Protected Classes 0.883466 0.031069
MultinomialNB F1 -2.489032 0.012809 Without Protected Classes 0.881842 0.000228
MultinomialNB Recall -7.701771 0.000000 Without Protected Classes 0.852538 0.001103
MultinomialNB Accuracy -1.242399 0.214090 Without Protected Classes 0.804602 0.000175
CategoricalNB ROC-AUC 1.049461 0.293966 With Protected Classes 0.731962 0.007958
MultinomialNB ROC-AUC 10.578913 0.000000 With Protected Classes 0.689026 0.002104

Examining the above figures and tables for each model type, performance metrics, and the distribution of performance metrics across 500 random trials, several findings are evident:

  • Bernoulli Naive Bayes is the best performing across all performance metrics at 89.1-96.5% as the mean metric score. In Table 7.27, the performance is sorted by the top mean metric in descending order. For all stats, BernoulliNB outperforms Categorical and Multinomial naive bayes models.

  • There are statistically significant differences in model performance for nearly all performance metrics, across all naive bayes model types (exceptions: CategoricalNB Precision, CategoricalNB ROC-AUC)

  • The difference in mean performance metrics, where significant, is less than 4% across all hypothesis tests.

These findings are revealing. First and foremost, if a naive bayes model were to be employed, it should be the Bernoulli format, regardless of what data is included. This can be of use in future modeling efforts. It effectively examines whether or not a record meets a specific set of conditions that support a conclusion of approval or denial of a loan.

Furthermore, in the case of Bernoulli naive bayes, there is a statistically significant outcome in favor of the exclusion of protected class information to deliver higher model performance (with metrics ranging from 0.07%-1% higher).

Were a financial institution to instead employ CategoricalNB or MultinomialNB, they would be doing themselves a disservice in terms of model performance. However, they do have access to additional variables and information not available to the general public, and may find that either of these models serve their predictive needs better than BernoulliNB.

That being the case - for each of these models, there are cases with statistically significant outcomes where the models perform better when protected class information is included in the training and testing data:

Model Stat z-score p-value top performer top mean difference in means
1 CategoricalNB Precision 2.768951 5.623710e-03 With Protected Classes 0.924105 0.006065
4 CategoricalNB ROC-AUC 1.049461 2.939661e-01 With Protected Classes 0.731962 0.007958
1 MultinomialNB Precision 12.735062 3.775460e-37 With Protected Classes 0.914013 0.000778
4 MultinomialNB ROC-AUC 10.578913 3.732618e-26 With Protected Classes 0.689026 0.002104

When the better performing model includes protected class information, one can clearly see that, while statistically significant, the difference in means between the models is less than 1%. With such a slight difference, performance gain by incorporating these features is not justified, and from an operational standpoint, the features can be excluded from models with minimal impact to effective predictions.