7 Naive Bayes

7.1 Overview

The Naive Bayes method of classification provides a fairly simple and accessible framework under which to estimate the probability that a subject is a member of a certain class, given prior evidence within one’s dataset. The methodology is commonly used in recommendation systems (e.g. streaming services, online gaming services, e-commerce, and many more) to quickly identify and suggest new actions, items, or activity to a user based on their past actions or activity.
The simplicity of the algorithm makes it an excellent candidate for assessing categorical data (in the case of multinomial Naive Bayes), as well as normally-distributed numerical data (for Gaussian Naive Bayes), and Bernoulli distributed or binary data (as for Bernoulli Naive Bayes). The assumptions made in the algorithm render it simple to implement, and to produce models with a reliable degree of performance.

7.1.1 What makes it Naive?

The following equations outline different formulations of Bayes’ Rule. In these formulations, let X represent the Data, and Y represent the Outcome or Label for the data.

\[ P(X|Y) = \frac{P(Y|X)\cdot P(X)}{P(Y)} \tag{7.1}\]

\[ P(X|Y) = \frac{P(X\cap Y)}{P(Y)} \tag{7.2}\]

\[ P(Y|X) = \frac{P(X|Y)\cdot P(X)}{P(Y)} \tag{7.3}\]

\[ P(Y|X) = \frac{P(X\cap Y)}{P(X)} \tag{7.4}\]

Bayes’ rule has no stipulation that X and Y be independent from one another in the source data. The requirement is that the input probability for \(P(X)\), \(P(Y)\), and \(P(X\cap Y)\) be the actual probability for each variable or combination thereof.

The algorithm is Naive because it assumes all features or input variables are independent of one another, or that each features’ probability does not directly impact one another. In this assumption, it means one can leverage the probability rule for independence between two variables:

\[ P(X\cap Y) = P(X)\cdot P(Y)\iff \text{X,Y are independent} \tag{7.5}\]

In the world of statistics, it can be quite rare to encounter a collected dataset in which all features are truly independent of one another.

Many novice, introductory statistics students sometimes make the assumption that variables are independent while working out problems in their homeworks, quizzes, or exams. This leads to incorrect responses, but makes the problem doable in relatively short order. In doing so, those students demonstrate their naivety. Since it makes the same assumption and performs no work to confirm or refute the claim, Naive Bayes is similarly naive.

This allows for the transformation of Equation 7.3, Equation 7.4, and Equation 7.5 to calculate probabilites when fitting data to a model and predicting the class of a new record:

\[ P(Y|X) = \frac{P(X_1|Y)\cdot P(X_2|Y)\cdot ...\cdot P(X_n|Y)}{P(Y)} \tag{7.6}\]

\[ P(Y|X) = \frac{P(X_1\cap Y)\cdot P(X_2 \cap Y)\cdot ...\cdot P(X_n\cap Y)}{P(Y)} \tag{7.7}\]

\[ P(Y|X) = \frac{\prod\limits_{i=1}^n P(X_i\cap Y)}{P(X)} \tag{7.8}\]

In Equation 7.8, one can see that the probability of belonging to a certain class, given an input record, is the product of the probabilities of each individual feature holding the same category for \(X_j\) when the class is \(Y\).

Because of the product formulation in Equation 7.8 above, it can be benefical to represent the equation using logarithms, rendering a simpler and more easily calculated implementation:

\[ \text{log}(P(Y|X)) = \frac{\text{log}\sum\limits_{i=1}^n P(X_i\cap Y)}{P(X)} \tag{7.9}\]

7.1.2 How does it work?

In short, prior probabilities are calculated as follows from a training dataset:

\[ P(X|Y) = \frac{P(X\cap Y)}{P(Y)} \]

The above is calculated for all features \(X_i\) in \(X\).

From there, these probabilities are applied in Equation 7.8 for every possible class outcome \(y_i\). The maximum amongst these probabilites is selected as the predicted class.

A simple way to understand Naive Bayes is as a calculation of the combined relative frequency for all features or variables in consideration when records belong to a specific class. These relative frequencies are treated as overall probabities, and are pre-calculated for a Naive Bayes model using a training dataset. When new records are introduced to the model for classification, Naive Bayes outputs the class that has the greatest probability using the pre-computed prior probabilites in conjunction with the new records’ feature data.

Consider the below example:

For this example, a dummy dataset is generated with 400 records. There are 4 classifications - R, G, Y, and B. There are 3 features for evaluation:

size, which takes on possible values of “S”, “M”, or “L”
cost, which takes on possible values of “free”,“low”,‘medium’, or ‘high’
tested, which is a boolean - True/False value

Below are the first 10 records of this dummy dataset:

Table 7.1: randomly generated dataset, first 10 records

class	size	cost	tested
R	S	low	False
R	S	low	False
R	S	high	False
R	M	free	False
R	M	low	False
R	L	free	True
R	L	high	False
R	M	free	False
R	S	free	False
R	S	free	False

To produce the probability tables, we must first calculate the prior probabilities for each of the classes in our label column, classes.

From here, we can iterate on the remaining columns with respect to the label column. In our case, we have the columns of cost, size, and tested.

The general process is as follows:

select the feature column
for each unique value within the feature column, \(X_j\):
- for each unique class in our labels:
  - get the raw count of the number of records where the feature column is equal to the unique value and the class is equal to the current class. This is a proxy for \(P(X_i|Y)\)
  - calculate the total raw of each class in the resulting filter, analogous to \(P(Y)\)
  - use the above calulations to calculate \(P(Y_i|X)\) as listed in Equation 7.2
  - select the highest value for \(P(Y_i|X)\) to select the class prediction

Using this, we can precompute relative frequencies and probabilities \(P(Y|X_i)\) for each value, given that the record is a member of the target class:

Table 7.2: relative frequency of size, given class

size	class	count	prob
S	B	34	0.272000
M	B	43	0.344000
L	B	48	0.384000
S	G	31	0.269565
M	G	47	0.408696
L	G	37	0.321739
S	R	25	0.333333
M	R	27	0.360000
L	R	23	0.306667
S	Y	36	0.423529
M	Y	23	0.270588
L	Y	26	0.305882

Table 7.3: relative frequency of cost, given class

cost	class	count	prob
low	B	24	0.192000
high	B	26	0.208000
free	B	38	0.304000
medium	B	37	0.296000
low	G	33	0.286957
high	G	22	0.191304
free	G	28	0.243478
medium	G	32	0.278261
low	R	18	0.240000
high	R	24	0.320000
free	R	19	0.253333
medium	R	14	0.186667
low	Y	20	0.235294
high	Y	19	0.223529
free	Y	19	0.223529
medium	Y	27	0.317647

Table 7.4: relative frequency of tested, given class

tested	class	count	prob
False	B	56	0.448000
True	B	69	0.552000
False	G	64	0.556522
True	G	51	0.443478
False	R	43	0.573333
True	R	32	0.426667
False	Y	46	0.541176
True	Y	39	0.458824

These probability tables are pre-computed and used for future classification tasks. We directly treat these as independent probabilities and on introduction of a new record, we simply perform the following tasks:

For every possible result classification, \(y_i\) (in our case, G, B, Y, and R):
- Take the new input record
- Start with prob = prior probability for each class (the relative frequency of each class \(y_i\) in the source data)
- Iterate through the pre-computed probability tables for size, cost, and tested:
  - get the value of the probability column where the class is the current possible result classification \(y_i\) and the record feature value is equal to the table feature value.
  - multiply the current value for prob by the selected value. This is analogous to Equation 7.8 for the current class \(y_i\)
- after completion we have \(P(Y=y_i|X)\)
now that we’ve collected all \(P(Y|X)\) for every \(y_i\), one selects the maximum value amongst all \(P(Y=y_i|X)\) and set the predicted class as \(y_i\)

Examining this process for the below example record:

Table 7.5: New Record for Testing the Model

size	cost	tested
L	medium	True

To perform the calculation, one can initialize the data with the prior probability \(P(Y)\) for each potential output class.

Table 7.6: Prior probabilites (P(Y))

B	G	Y	R
0.312500	0.287500	0.212500	0.187500

From here, one must examine the values in each table, where the column in Table 7.5 column is equal to the same column in each of our pre-computed tables, Table 7.2,

Table 7.7: Probabilites from Size Frequencies where size=L

B	G	Y	R
0.384000	0.321739	0.305882	0.306667

for Table 7.3,

Table 7.8: Probabilites from Size Frequencies where cost=medium

B	G	Y	R
0.296000	0.278261	0.317647	0.186667

and for Table 7.4

Table 7.9: Probabilites from Tested Frequencies where tested=True

B	G	Y	R
0.552000	0.443478	0.458824	0.426667

Taking each of these (the prior probabilities and the extracted probabilities), one can calculate the probability \(P(Y_i|X))\) for the input record

Table 7.10: Calculation of P(Y|X)

class	B	G	Y	R
prob	0.019607	0.011415	0.009473	0.00458

The final result, one can see, is that the classification of the record will be “B”, as it has the highest probability amongst all the potential classes.

7.1.3 The Zero-Frequency Problem

One challenge with the Naive Bayes algorithm is when one or more of the feature spaces has zero occurences within a given output class. When this occurs, it sets the probability for a record being a member of that class to zero, meaning that no newly introduced data or records can ever be classified as a member of that class by the algorithm. Without additional data and training, the model will never adapt to these new inputs, because the calculation is a running product of the relative frequencies. Thus, a single relative frequency of zero will result in an overall probability of zero, and the output class will not be predicted.

What if no records of class Y in the source data were ever tested?

tested	class	count	prob
False	B	56	0.448000
True	B	69	0.552000
False	G	64	0.556522
True	G	51	0.443478
False	R	43	0.573333
True	R	32	0.426667
False	Y	85	1.000000
True	Y	0	0.000000

One can see that the change of all records where class=Y and tested=True have a probability of zero.

This has a substatial impact on predictions, namely - whenever a record has tested=True, the trained model will produce a zero probability result for the class of a new record belonging to Y, and thus no new records will ever be classified as Y. Here’s the predicted outcome for the record in the previous example:

size	cost	tested
L	medium	True

We see the record has tested = True, so the outcome will never be class Y:

R	G	Y	B
0.004580	0.011415	0.000000	0.019607

The absence of a probability is, well, problematic. If the new record truly did belong to class Y, the model can never predict it as belonging to the class, due to the absence of data. There are methods and means of handling this issue, however.

One option includes updating an existing Naive Bayes model in relatively short order with new records, new training set information, and recomputing the prior probabilities for classification of future records. Without such additional data and retraining/updating, however, the classification challenge will remain.

Another option to rectify the zero-frequency issue is via smoothing methods. These methods allow for any possibility to occur, and assign minute, non-zero probabilities to any cases which have zero frequency within the training dataset. In doing this it ensures that, for every value of every considered feature, there is some probability \(p\) strictly greater than zero assigned, thus allowing potential predictions into the appropriate class for any new input tuple.

A common smoothing technique for Multinomial Naive Bayes is Laplace smoothing. The technique adapts the calculation of the probabilites for each feature \(P(X_i|Y)\) in the following manner:

\[ P(X|Y) = \frac{P(Y|X) + \alpha}{P(Y)+\alpha n} \tag{7.10}\]

Where \(\alpha\) is equal to 1, and n is the number of total categories \(Y\) in the dataset. There are The additive non-zero values to the numerator \(\alpha\) and the denominator \(\alpha n\) ensures that the probabilities for all \(P(X_i|Y)\) are greater than zero, thus enabling them to (potentially) be predicted by the model. It’s possible that a single, and potentially less important, feature could be the difference between whether or not the correct class can be predicted absent such smoothing, but the inclusion of Laplace or other smoothing techniques can support better modeling.

Thus far, everything explained above examines Categorical naive bayes. There are several other versions of the Naive Bayes algorithm that operate under the same independence assumption, but algorithmic performance differs from what has been explained thus far.

7.1.4 Bernoulli Naive Bayes

This algorithm performs similarly to the Multinomial Naive Bayes algorithm, but on binary encoded (0/1) data. Every category in each feature needs to be transferred into a column, and then each column is set to a zero if a record does not have the specified category value for that feature, and a one otherwise.

However, the probabilities are calculated differently from that of the Multinomial Naive Bayes Algorithm. For calculating the values of the prior probabilities in the training data, the formula is repaced as follows:

\[ P(x_i|y) = P(x_i=1|y)\cdot x_i + (1-P(x_i=1|y))\cdot(1-x_i) \]

This is similar to the construct of the binomial distribution (chance for k successes in n trials), and is repeated for each feature value \(x_i\) in the dataset. This is of benefit to Bernoulli naive bayes over Multinomial, as it will include and penalize non-occuring combinations of \(x_i\) and \(y\) together in the dataset.

Some advantages of Bernoulli naive bayes are that it is relatively simple (for small datasets) to implement and that it performs well on tasks such as task classification (e.g. detecting spam emails, for instance). However, it is only able to categorize or predict on a binary outcome, similar to logistic regression. This is applicable for this research, but may not fit all use cases.

7.2 Data and Code

To prepare the data, several steps were necessary. Namely, the data for this research effort is of mixed (categorical and quantitative) types. Different data transformation techniques and algorithms were required for application to source data to place it in a usable format for each Naive Bayes algorithm.

For all model code, in addition to building 2 models with and without protected class information, an exploration into statistically significant differences in model performance was performed. The experiment is constructed under the following parameters:

For each naive bayes model type, repeat the following steps, 500 times:
- initialize a random seed
- sample the records from both datasets (with and without protected class information) using the same random seed so as to pull the same records, on an 80/20 train/test split. This ensures the same subjects are present in the training and testing data for each random sampling.
- train two models using the training data
  - in the event of an error in fitting or predicting under the current train/test split, decrement the loop counter so that we ensure 500 measurements and restart the loop.
- predict the outcomes using the testing data
- capture model performance metrics (accuracy, precision, recall, F1, ROC-AUC) for the two models trained with and without protected class information (age/gender/race).
After metrics for 500 models are captured:
- Construct a paired-t test for difference in means between the two models’ performance metrics, on a per-metric basis.
  - \(H_0\): There is no difference in the mean performance metric for models when trained with and without protected class information
  - \(H_A\): One of the models has a higher mean performance metric when protected class information is included
  - \(\alpha=0.003\) or a \(3\sigma\) confidence
- Visualize the distribution of performance metrics for each model
- Conclude on any statistically significant differences between the models’ performance

The output for these tests are located as follows:

7.2.1 Multinomial Naive Bayes

The data prepration and code were executed natively in Appendix G, sourcing from the final clean dataset.

Multinomial Naive Bayes requires count data. Since the native format of the initially cleaned dataset is purely record data, there are challenges to convert this to count information. Namely, each feature and feature value either occurs, or it doesn’t. As such, the data had to be transformed into a one-hot encoded dataset with either 1 if the feature value occurred in the record, and 0 otherwise. This same data is leveraged in Bernoulli Naive Bayes for this portion of the research.

Table 7.11: Initial Data Used

	state_code	county_code	derived_sex	action_taken	preapproval	open-end_line_of_credit	loan_amount	loan_to_value_ratio	interest_rate	...	tract_median_age_of_housing_units	applicant_race	co-applicant_race	applicant_ethnicity	co-applicant_ethnicity	aus	denial_reason	outcome	company	income_from_median
0	OH	39153.0	Sex Not Available	1	2	2	665000.0	85.000	4.250	...	36	32768	131072	32	128	64	512	1.0	JP Morgan	True
1	NY	36061.0	Male	1	2	2	755000.0	21.429	4.250	...	0	32768	262144	64	256	64	512	1.0	JP Morgan	False
2	NY	36061.0	Sex Not Available	1	1	2	965000.0	80.000	5.250	...	0	65536	262144	64	256	64	512	1.0	JP Morgan	False
3	FL	12011.0	Male	1	2	2	705000.0	92.175	5.125	...	12	32768	262144	32	256	64	512	1.0	JP Morgan	False
4	MD	24031.0	Joint	1	2	2	1005000.0	65.574	5.625	...	69	66	32768	32	32	64	512	1.0	JP Morgan	False
5	NC	37089.0	Joint	1	1	2	695000.0	85.000	6.000	...	39	32768	32768	32	32	64	512	1.0	JP Morgan	False
6	CA	6073.0	Joint	2	2	2	905000.0	75.000	6.250	...	44	2	2	32	32	64	512	1.0	JP Morgan	False
7	NY	36061.0	Sex Not Available	2	1	2	355000.0	15.909	5.625	...	63	65536	65536	64	64	64	512	1.0	JP Morgan	False
8	NY	36061.0	Joint	1	1	2	1085000.0	90.000	5.625	...	75	32768	32768	32	32	64	512	1.0	JP Morgan	True
9	MO	29189.0	Sex Not Available	2	1	2	405000.0	53.333	5.750	...	0	65536	65536	64	64	64	512	1.0	JP Morgan	True

10 rows × 51 columns

Table 7.12: MultinomialNB Training Data (With protected classes)

	derived_sex_Female	derived_sex_Joint	derived_sex_Male	derived_sex_Sex Not Available	purchaser_type_0	purchaser_type_1	purchaser_type_3	...	co-applicant_ethnicity_No Co-applicant	aus_Desktop Underwriter	aus_Loan Prospector/Product Advisor	aus_Other	aus_Internal Proprietary	outcome
149746	1	0	0	0	0	1	0	...	1	1	0	0	0	1.0
105015	0	0	1	0	1	0	0	...	1	1	0	0	0	1.0
29094	0	1	0	0	0	1	0	...	0	1	1	1	0	1.0
101082	0	0	1	0	1	0	0	...	1	1	0	0	0	1.0
77750	1	0	0	0	1	0	0	...	1	0	0	0	1	1.0
197336	0	1	0	0	1	0	0	...	0	0	1	0	0	0.0
137650	0	0	0	1	0	1	0	...	0	1	0	0	0	1.0
136772	1	0	0	0	0	1	0	...	1	1	0	0	0	1.0
41734	1	0	0	0	1	0	0	...	1	0	1	0	1	0.0
10710	1	0	0	0	0	0	1	...	1	0	1	1	0	1.0

10 rows × 244 columns

Table 7.13: MultinomialNB Testing Data (With protected classes)

	derived_sex_Joint	derived_sex_Male	derived_sex_Sex Not Available	purchaser_type_0	purchaser_type_3	...	co-applicant_ethnicity_No Co-applicant	aus_Desktop Underwriter	aus_Loan Prospector/Product Advisor	aus_Other	aus_Internal Proprietary	aus_Not applicable	outcome
30860	0	1	0	0	1	...	1	1	1	1	0	0	1.0
126890	0	0	1	0	1	...	1	1	0	0	0	0	1.0
28730	0	1	0	0	1	...	1	1	1	1	0	0	1.0
31244	0	1	0	0	1	...	1	1	1	1	0	0	1.0
56105	0	1	0	0	1	...	0	0	1	0	0	0	1.0
5443	0	1	0	0	1	...	1	0	1	0	0	0	1.0
95702	1	0	0	1	0	...	0	0	0	0	0	1	1.0
83812	1	0	0	1	0	...	0	0	0	0	1	0	0.0
84338	1	0	0	0	1	...	0	0	1	0	0	0	1.0
124545	1	0	0	0	1	...	0	1	0	0	0	0	1.0

10 rows × 244 columns

Table 7.14: MultinomialNB Training Data (No protected classes)

	purchaser_type_0	purchaser_type_1	purchaser_type_3	preapproval_2	...	tract_median_age_of_housing_units_M	tract_median_age_of_housing_units_MH	tract_median_age_of_housing_units_ML	company_Bank of America	company_JP Morgan	company_Navy Federal Credit Union	company_Rocket Mortgage	company_Wells Fargo	outcome
149746	0.0	1.0	0.0	1.0	...	1.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0
105015	1.0	0.0	0.0	1.0	...	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0
29094	0.0	1.0	0.0	1.0	...	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0
101082	1.0	0.0	0.0	1.0	...	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	1.0
77750	1.0	0.0	0.0	1.0	...	0.0	1.0	0.0	0.0	0.0	0.0	0.0	1.0	1.0
197336	1.0	0.0	0.0	1.0	...	1.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0
137650	0.0	1.0	0.0	1.0	...	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	1.0
136772	0.0	1.0	0.0	1.0	...	1.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0
41734	1.0	0.0	0.0	1.0	...	0.0	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0
10710	0.0	0.0	1.0	1.0	...	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0

10 rows × 176 columns

Table 7.15: MultinomialNB Training Data (No protected classes)

	purchaser_type_0	purchaser_type_3	preapproval_2	...	tract_median_age_of_housing_units_M	tract_median_age_of_housing_units_MH	tract_median_age_of_housing_units_ML	company_Bank of America	company_JP Morgan	company_Navy Federal Credit Union	company_Rocket Mortgage	company_Wells Fargo	outcome
30860	0.0	1.0	1.0	...	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0
126890	0.0	1.0	1.0	...	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0
28730	0.0	1.0	1.0	...	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0
31244	0.0	1.0	1.0	...	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0
56105	0.0	1.0	1.0	...	0.0	0.0	1.0	1.0	0.0	0.0	0.0	0.0	1.0
5443	0.0	1.0	1.0	...	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0
95702	1.0	0.0	1.0	...	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0
83812	1.0	0.0	1.0	...	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0
84338	0.0	1.0	1.0	...	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	1.0
124545	0.0	1.0	1.0	...	1.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0

10 rows × 176 columns

Also of note: the record indexes in Table 9.6 and Table 9.8 match one another, meaning that the two models are trained on the same subjects. Similarly, the indexes between Table 9.7 and Table 9.9 match, so they are tested on the same subjects between the two models.

Also note that the indexes between Table 9.6 and Table 9.7 are disjoint - that means that the model has disjoint training and testing data. Similarly, Table 9.8 and Table 9.9 are disjoint.

By achieving these splits, the two models evaluated with and without protected class information will avoid unnecessary biases in the results. When a model is tested on data with which it has already been trained, the model has already optimized to the best of its ability to correctly classify the training data. As such, the outcome of an evaluation of a model using the same data in training and testing will artificially inflate its performance metrics (accuracy, precision, recall, F1, ROC-AUC). As such, it is paramount to have a disjoint training and testing dataset.

7.2.2 Bernoulli Naive Bayes

The data preparation was executed in Appendix G, sourced from the final clean dataset. To prepare the data for two different models, 2 copies were made. The second copy dropped all columns that included protected class information, and the first retained all source columns.

To perform an MCA, the data must be in a one-hot-encoded (or binary 0/1) format per category and feature. Since this work was previously done, the steps were seamless to include for this purpose.

Examples of the two dataset samples used for Bernoulli Naive Bayes are summarized in Table 9.6, Table 9.7, Table 9.8, and Table 9.9.

7.2.3 Categorical Naive Bayes

The data for categorical naive bayes was sourced from the final clean dataset.

In Appendix D, the data is transformed into label-encoded format to perform the categorical naive bayes analysis. In some cases, binary data was captured (mostly in terms of protected class information). These binary caputred data was parsed back out into featurename:featurevalue columns with a 1 if the record met the condition, and zero otherwise. This format still meets the needs of categorical naive bayes.

Table 7.16: CategoricalNB Training Data (With protected classes)

	derived_sex	preapproval	open-end_line_of_credit	loan_amount	loan_to_value_ratio	interest_rate	total_loan_costs	origination_charges	discount_points	lender_credits	...	co-applicant_ethnicity_No Co-applicant	aus_Desktop Underwriter	aus_Loan Prospector/Product Advisor	aus_Other	aus_Internal Proprietary	outcome
149746	0	1	1	1	2	2	1	1	1	1	...	1	1	0	0	0	1.0
105015	2	1	1	1	2	2	1	1	1	1	...	1	1	0	0	0	1.0
29094	1	1	1	1	2	2	1	1	1	1	...	0	1	1	1	0	1.0
101082	2	1	1	1	2	2	0	0	0	1	...	1	1	0	0	0	1.0
77750	0	1	1	2	1	2	1	1	1	1	...	1	0	0	0	1	1.0
197336	1	1	1	1	2	2	1	1	1	1	...	0	0	1	0	0	0.0
137650	3	1	1	2	2	0	3	1	1	1	...	0	1	0	0	0	1.0
136772	0	1	1	1	4	2	1	2	1	1	...	1	1	0	0	0	1.0
41734	0	1	1	1	1	2	1	1	1	1	...	1	0	1	0	1	0.0
10710	0	1	1	1	2	2	1	1	1	1	...	1	0	1	1	0	1.0

10 rows × 101 columns

Table 7.17: CategoricalNB Training Data (With protected classes)

	derived_sex	preapproval	open-end_line_of_credit	loan_amount	loan_to_value_ratio	interest_rate	total_loan_costs	origination_charges	discount_points	lender_credits	...	co-applicant_ethnicity_No Co-applicant	aus_Desktop Underwriter	aus_Loan Prospector/Product Advisor	aus_Other	aus_Internal Proprietary	aus_Not applicable	outcome
30860	2	1	1	2	2	2	3	1	1	1	...	1	1	1	1	0	0	1.0
126890	3	1	1	3	2	3	1	1	1	1	...	1	1	0	0	0	0	1.0
28730	2	1	1	1	2	3	3	1	1	1	...	1	1	1	1	0	0	1.0
31244	2	1	1	2	2	2	0	0	0	1	...	1	1	1	1	0	0	1.0
56105	2	1	1	3	1	2	1	1	1	1	...	0	0	1	0	0	0	1.0
5443	2	1	1	3	2	2	1	1	1	1	...	1	0	1	0	0	0	1.0
95702	1	1	1	1	3	0	1	1	1	1	...	0	0	0	0	0	1	1.0
83812	1	1	1	2	2	2	1	1	1	1	...	0	0	0	0	1	0	0.0
84338	1	1	1	1	1	2	3	1	3	1	...	0	0	1	0	0	0	1.0
124545	1	1	1	1	2	2	1	1	1	1	...	0	1	0	0	0	0	1.0

10 rows × 101 columns

Table 7.18: CategoricalNB Training Data (No protected classes)

	preapproval	open-end_line_of_credit	loan_amount	loan_to_value_ratio	interest_rate	total_loan_costs	origination_charges	discount_points	lender_credits	loan_term	...	company	aus_Desktop Underwriter	aus_Loan Prospector/Product Advisor	aus_Other	aus_Internal Proprietary	outcome
149746	1	1	1	2	2	1	1	1	1	2	...	3	1	0	0	0	1.0
105015	1	1	1	2	2	1	1	1	1	2	...	2	1	0	0	0	1.0
29094	1	1	1	2	2	1	1	1	1	2	...	1	1	1	1	0	1.0
101082	1	1	1	2	2	0	0	0	1	2	...	2	1	0	0	0	1.0
77750	1	1	2	1	2	1	1	1	1	2	...	4	0	0	0	1	1.0
197336	1	1	1	2	2	1	1	1	1	2	...	3	0	1	0	0	0.0
137650	1	1	2	2	0	3	1	1	1	2	...	3	1	0	0	0	1.0
136772	1	1	1	4	2	1	2	1	1	2	...	3	1	0	0	0	1.0
41734	1	1	1	1	2	1	1	1	1	2	...	0	0	1	0	1	0.0
10710	1	1	1	2	2	1	1	1	1	2	...	1	0	1	1	0	1.0

10 rows × 34 columns

Table 7.19: CategoricalNB Training Data (No protected classes)

	preapproval	open-end_line_of_credit	loan_amount	loan_to_value_ratio	interest_rate	total_loan_costs	origination_charges	discount_points	lender_credits	loan_term	...	company	aus_Desktop Underwriter	aus_Loan Prospector/Product Advisor	aus_Other	aus_Internal Proprietary	aus_Not applicable	outcome
30860	1	1	2	2	2	3	1	1	1	2	...	1	1	1	1	0	0	1.0
126890	1	1	3	2	3	1	1	1	1	2	...	3	1	0	0	0	0	1.0
28730	1	1	1	2	3	3	1	1	1	2	...	1	1	1	1	0	0	1.0
31244	1	1	2	2	2	0	0	0	1	2	...	1	1	1	1	0	0	1.0
56105	1	1	3	1	2	1	1	1	1	1	...	0	0	1	0	0	0	1.0
5443	1	1	3	2	2	1	1	1	1	2	...	1	0	1	0	0	0	1.0
95702	1	1	1	3	0	1	1	1	1	2	...	2	0	0	0	0	1	1.0
83812	1	1	2	2	2	1	1	1	1	2	...	4	0	0	0	1	0	0.0
84338	1	1	1	1	2	3	1	3	1	2	...	4	0	1	0	0	0	1.0
124545	1	1	1	2	2	1	1	1	1	2	...	3	1	0	0	0	0	1.0

10 rows × 34 columns

Just as for MultinomialNB, the data for CategoricalNB has the similar combination of same indexes (between datasets) and disjoint indexes (between train test splits).

7.3 Results

7.3.1 Categorical Naive Bayes

Figure 7.1: Confusion Matrices (CategoricalNB, Single-Run)

Table 7.20: Model Performance Metrics (CategoricalNB, Single Run)

Model	Data	Accuracy	Precision	Recall	F1	ROC-AUC
CategoricalNB	With Protected Classes	0.882012	0.943416	0.917050	0.930046	0.795993
CategoricalNB	Without Protected Classes	0.912455	0.934819	0.964922	0.949632	0.783651

Figure 7.2: Metric Kernel Density Estimates (500 randomizations, CategoricalNB)

Table 7.21: Statistical Significance Tests (Model Performance Metrics, CategoricalNB)

Model	Stat	z-score	p-value	top performer	top mean	difference in means
CategoricalNB	Accuracy	-8.271750	0.000000	Without Protected Classes	0.883466	0.031069
CategoricalNB	Precision	2.768951	0.005624	With Protected Classes	0.924105	0.006065
CategoricalNB	Recall	-21.365216	0.000000	Without Protected Classes	0.948421	0.046966
CategoricalNB	F1	-9.276604	0.000000	Without Protected Classes	0.932983	0.020344
CategoricalNB	ROC-AUC	1.049461	0.293966	With Protected Classes	0.731962	0.007958

7.3.2 Multinomial Naive Bayes

Figure 7.3: Confusion Matrices (MultinomialNB, Single-Run)

Table 7.22: Model Performance Metrics (MultinomialNB, Single Run)

Model	Data	Accuracy	Precision	Recall	F1	ROC-AUC
MultinomialNB	With Protected Classes	0.936358	0.967880	0.957361	0.962591	0.884798
MultinomialNB	Without Protected Classes	0.939653	0.969309	0.959833	0.964548	0.890112

Figure 7.4: Metric Kernel Density Estimates (500 randomizations, MultinomialNB)

Table 7.23: Statistical Significance Tests (Model Performance Metrics, MultinomialNB)

Model	Stat	z-score	p-value	top performer	top mean	difference in means
MultinomialNB	Accuracy	-1.242399	0.214090	Without Protected Classes	0.804602	0.000175
MultinomialNB	Precision	12.735062	0.000000	With Protected Classes	0.914013	0.000778
MultinomialNB	Recall	-7.701771	0.000000	Without Protected Classes	0.852538	0.001103
MultinomialNB	F1	-2.489032	0.012809	Without Protected Classes	0.881842	0.000228
MultinomialNB	ROC-AUC	10.578913	0.000000	With Protected Classes	0.689026	0.002104

7.3.3 Bernoulli Naive Bayes

Figure 7.5: Confusion Matrices (BernoulliNB, Single-Run)

Table 7.24: Model Performance Metrics (BernoulliNB, Single Run)

Model	Data	Accuracy	Precision	Recall	F1	ROC-AUC
BernoulliNB	With Protected Classes	0.937514	0.973560	0.952818	0.963077	0.899943
BernoulliNB	Without Protected Classes	0.939875	0.977551	0.951553	0.964377	0.911205

Figure 7.6: Metric Kernel Density Estimates (500 randomizations, BernoulliNB)

Table 7.25: Statistical Significance Tests (Model Performance Metrics, BernoulliNB)

Model	Stat	z-score	top performer	top mean	difference in means
BernoulliNB	Accuracy	-53.447073	Without Protected Classes	0.940754	0.003477
BernoulliNB	Precision	-44.319309	Without Protected Classes	0.969658	0.002109
BernoulliNB	Recall	-31.047924	Without Protected Classes	0.960794	0.001971
BernoulliNB	F1	-52.811800	Without Protected Classes	0.965205	0.002040
BernoulliNB	ROC-AUC	-49.602472	Without Protected Classes	0.891554	0.007172

7.3.4 Overall

Table 7.26: Summary of single-run model performance outcomes

Model	Data	Accuracy	Precision	Recall	F1	ROC-AUC
CategoricalNB	With Protected Classes	0.882012	0.943416	0.917050	0.930046	0.795993
CategoricalNB	Without Protected Classes	0.912455	0.934819	0.964922	0.949632	0.783651
MultinomialNB	With Protected Classes	0.936358	0.967880	0.957361	0.962591	0.884798
MultinomialNB	Without Protected Classes	0.939653	0.969309	0.959833	0.964548	0.890112
BernoulliNB	With Protected Classes	0.937514	0.973560	0.952818	0.963077	0.899943
BernoulliNB	Without Protected Classes	0.939875	0.977551	0.951553	0.964377	0.911205

Table 7.27: Summary of 500 Randomization Test Outcomes (All Model Types)

Model	Stat	z-score	p-value	top performer	top mean	difference in means
BernoulliNB	Precision	-44.319309	0.000000	Without Protected Classes	0.969658	0.002109
BernoulliNB	F1	-52.811800	0.000000	Without Protected Classes	0.965205	0.002040
BernoulliNB	Recall	-31.047924	0.000000	Without Protected Classes	0.960794	0.001971
CategoricalNB	Recall	-21.365216	0.000000	Without Protected Classes	0.948421	0.046966
BernoulliNB	Accuracy	-53.447073	0.000000	Without Protected Classes	0.940754	0.003477
CategoricalNB	F1	-9.276604	0.000000	Without Protected Classes	0.932983	0.020344
CategoricalNB	Precision	2.768951	0.005624	With Protected Classes	0.924105	0.006065
MultinomialNB	Precision	12.735062	0.000000	With Protected Classes	0.914013	0.000778
BernoulliNB	ROC-AUC	-49.602472	0.000000	Without Protected Classes	0.891554	0.007172
CategoricalNB	Accuracy	-8.271750	0.000000	Without Protected Classes	0.883466	0.031069
MultinomialNB	F1	-2.489032	0.012809	Without Protected Classes	0.881842	0.000228
MultinomialNB	Recall	-7.701771	0.000000	Without Protected Classes	0.852538	0.001103
MultinomialNB	Accuracy	-1.242399	0.214090	Without Protected Classes	0.804602	0.000175
CategoricalNB	ROC-AUC	1.049461	0.293966	With Protected Classes	0.731962	0.007958
MultinomialNB	ROC-AUC	10.578913	0.000000	With Protected Classes	0.689026	0.002104

Examining the above figures and tables for each model type, performance metrics, and the distribution of performance metrics across 500 random trials, several findings are evident:

Bernoulli Naive Bayes is the best performing across all performance metrics at 89.1-96.5% as the mean metric score. In Table 7.27, the performance is sorted by the top mean metric in descending order. For all stats, BernoulliNB outperforms Categorical and Multinomial naive bayes models.
There are statistically significant differences in model performance for nearly all performance metrics, across all naive bayes model types (exceptions: CategoricalNB Precision, CategoricalNB ROC-AUC)
The difference in mean performance metrics, where significant, is less than 4% across all hypothesis tests.

These findings are revealing. First and foremost, if a naive bayes model were to be employed, it should be the Bernoulli format, regardless of what data is included. This can be of use in future modeling efforts. It effectively examines whether or not a record meets a specific set of conditions that support a conclusion of approval or denial of a loan.

Furthermore, in the case of Bernoulli naive bayes, there is a statistically significant outcome in favor of the exclusion of protected class information to deliver higher model performance (with metrics ranging from 0.07%-1% higher).

Were a financial institution to instead employ CategoricalNB or MultinomialNB, they would be doing themselves a disservice in terms of model performance. However, they do have access to additional variables and information not available to the general public, and may find that either of these models serve their predictive needs better than BernoulliNB.

That being the case - for each of these models, there are cases with statistically significant outcomes where the models perform better when protected class information is included in the training and testing data:

	Model	Stat	z-score	p-value	top performer	top mean	difference in means
1	CategoricalNB	Precision	2.768951	5.623710e-03	With Protected Classes	0.924105	0.006065
4	CategoricalNB	ROC-AUC	1.049461	2.939661e-01	With Protected Classes	0.731962	0.007958
1	MultinomialNB	Precision	12.735062	3.775460e-37	With Protected Classes	0.914013	0.000778
4	MultinomialNB	ROC-AUC	10.578913	3.732618e-26	With Protected Classes	0.689026	0.002104

When the better performing model includes protected class information, one can clearly see that, while statistically significant, the difference in means between the models is less than 1%. With such a slight difference, performance gain by incorporating these features is not justified, and from an operational standpoint, the features can be excluded from models with minimal impact to effective predictions.