Exploring Protected Classes as Predictors in Mortgage Approval

9.1 Overview

9.1.1 Linear Regression

9.1.1.1 What is Linear Regression?

Linear regression is a special case of generalized linear modeling that seeks to minimize the mean squared error on a continuous spectrum between each all datapoints in the dataset and the predicted output value. It makes an assumption that the underlying data can be modeled using a weighted linear combination of sums:

\[ \hat{y}=w_1x_1+w_2x_2+...+w_kx_k + b \tag{9.1}\]

The optimization function is the mean squared error, or the distance between each datapoint in the dataset and the prediction line:

\[ \text{MSE}_i = \frac{1}{n}\cdot\sum\limits_{i=1}^{n}(y_i-\hat{y}_i)^2 \tag{9.2}\]

The act of identifying the weights allows a user to make a prediction on the interval \((-\infty,\infty)\) to make a numeric prediction for an output or dependent variable \(Y\), based upon a vector of input \(X\).

9.1.2 Logistic Regression

9.1.2.1 What is Logistic Regression?

Logistic regression is a method to predict a probability of a boolean outcome on the interval from \([0,1]\). To perform this action, logistic regression uses linear regression in combination with the sigmoid function \(\hat{y}=\frac{1}{1+e^{-z}}\) where \(z = w_1x_1+w_2x_2+...+w_kx_k + b\), and gradient ascent (or descent, depending on implementation), to link the linear equation from all real numbers to the interval [0,1].

9.1.2.2 How do Logistic and Linear Regression Compare?

The key similarity between linear and logisitic regression is the use of a linear combination of weighted sums as in Equation 9.1 above, and each leverage an optimization problem to maximize performance of their predictive outcomes. However, this is about where their similarities end. The difference in the output space (\((-\infty,\infty)\) vs \([0,1]\)), the type of optimization problem used (mean squared error minimization vs. maximum likelihood estimation), and their applications (continuous numeric prediction vs. probability prediction), truly set these two regression methods apart.

9.1.2.3 How Does Logistic Regression Predict Probability?

The sigmoid function is paramount in the implementation of logisitic regression as it remaps the output space of a linear regression to have limited bounds on the interval \([0,1]\). Since linear regression has an infinite output space, it is less useful in the prediction of an outcome or a class. The sigmoid function, and derivations from it, are used to train logistic regression models and to make predictions with them once they are trained.

9.1.2.4 How is Logistic Regression Trained?

Maximum likelihood estimation is used as part of the optimization effort for a logisitic regression. Each weight \(w_i\) in the equation for \(z\) above needs to be adapted and adjusted so as to maximize the likelihood that the input dataset is correctly classified. To calculate the change in probability, the algorithm leverages partial derivatives to calculate how each weight should be adjusted while pursuing a local maximum (gradient ascent) or minimum (gradient descent) using

the log likelihood function \[ \text{log(p(y|x)} = \sum\limits_{i=1}^ny_i\cdot \text{log}(\hat{y}_i)+(1-y_i)\cdot\text{log}(1-\hat{y}_i) \tag{9.3}\]
partial derivatives of the likelihood with respect to each feature \[ \frac{\delta L}{\delta w_i} = (\hat{y_i}-y_i)x_i = \frac{\delta L}{\delta \hat{y}_i}\frac{\delta\hat{y}_i}{\delta z}\frac{\delta z}{\delta w_i} \tag{9.4}\]

to iteratively move the needle in the right direction.

9.1.2.5 How Is Maximum Likelihood Used in Logisitic Regression?

Using the above, each iteration of a logistic regression seeks to increase the likelihood, or probability to that a given input vector \(X\) will be classified as the appropriate category \(Y\) by adjusting the weights to bring the output probability closer to zero when \(X\) is not a member of \(Y\), and closer to 1 when \(X\) is a member of \(Y\). This method can enter the pitfall of solely achieving a local minimum or maximum vs. a global minimum or maximum, as one cannot directly know or infer a parametric equation for \(\text{log(p(y|x))}\).

9.2 Data

The data used for logistic regression is output from the multiple correspondence analysis (MCA) section. The code will leverage two outputs - the MCA that includes the protected classes, and the MCA that does not.

The process to generate these data is contained and described in Appendix G.

9.2.1 Logistic Regression Data

As done in Chapter 7 for train test splits - the indexes are disjoint between training and testing datasets for each parent dataset (with and without protected classes), and each model is trained with the same records for comparison and evaluation against one another (same records for training and testing allow for direct comparison of the models).

By achieving these splits, the two models evaluated with and without protected class information will avoid unnecessary biases in the results. When a model is tested on data with which it has already been trained, the model has already optimized to the best of its ability to correctly classify the training data. As such, the outcome of an evaluation of a model using the same data in training and testing will artificially inflate its performance metrics (accuracy, precision, recall, F1, ROC-AUC). As such, it is paramount to have a disjoint training and testing dataset.

Here are the first few rows of the training and testing data (excluding protected classes):

Table 9.1: MCA Training Data (With Protected Classes)

	MC1	MC2	MC3	MC4	MC5	MC6	MC7	MC8	MC9	MC10	...	MC173	MC174	MC175	MC176	MC177	MC178	MC179	MC180	MC181	outcome
149746	-0.387299	-0.001936	-0.003167	-0.003736	-0.024160	-0.058848	-0.161641	-0.163059	0.051339	-0.009136	...	0.003101	-0.001341	0.000861	0.004199	-0.000527	-0.002532	-0.008719	-0.000873	-0.002317	1.0
105015	-0.386534	0.007586	0.005638	-0.149832	0.037838	0.251784	0.098772	-0.104072	0.013961	-0.418258	...	0.000377	0.001873	0.001106	-0.027093	0.005149	0.003389	0.007496	-0.003416	0.003757	1.0
29094	0.584290	-0.011996	-0.005619	0.162373	-0.135637	-0.251829	-0.104188	-0.168750	-0.177008	0.118405	...	0.022515	-0.007842	-0.002415	-0.049875	0.001095	0.072832	-0.005292	0.078073	0.014257	1.0
101082	-0.334664	0.002222	0.000164	-0.101558	0.021665	-0.211819	0.279624	0.695377	0.687991	-0.076981	...	0.026285	0.001354	-0.002768	0.006803	0.007875	0.004828	0.006313	-0.007748	0.009507	1.0
77750	-0.281561	0.009739	0.007302	0.294964	-0.119251	0.159063	0.758784	-0.039995	-0.195625	0.002697	...	-0.052733	-0.000610	-0.008253	0.117513	0.000083	0.007967	-0.010744	-0.000727	0.000273	1.0
197336	0.486499	-0.010249	-0.000090	-0.015523	-0.006435	0.167655	-0.198908	0.150687	-0.162005	0.046811	...	-0.012404	-0.000997	-0.002124	0.001909	0.000212	-0.007067	0.015649	0.006640	-0.000096	0.0
137650	0.467874	0.001350	-0.009676	-1.188243	0.363408	0.128250	0.223274	-0.317145	0.196431	-0.287222	...	-0.019506	-0.000767	-0.000612	-0.005116	0.001205	-0.005075	0.000100	-0.000287	-0.002095	1.0
136772	-0.367590	-0.001345	-0.000256	-0.156002	0.020358	-0.132456	0.018576	-0.093998	0.103849	0.001622	...	-0.106859	0.004671	0.000376	0.009465	-0.000201	0.007182	-0.011129	-0.008761	-0.001996	1.0
41734	-0.330008	0.011018	0.006242	0.272821	-0.054578	0.325838	0.372732	0.030138	-0.028335	0.063026	...	0.011053	0.001163	0.004312	-0.062842	0.000968	0.007450	-0.009850	-0.003101	0.003397	0.0
10710	-0.426415	-0.004343	-0.003192	0.108570	-0.054307	-0.125460	-0.280849	0.185723	-0.081365	0.132101	...	0.020289	-0.005505	0.000400	0.092377	0.001436	0.006528	-0.011197	0.000699	-0.019417	1.0

10 rows × 182 columns

Table 9.2: MCA Testing Data (With Protected Classes)

	MC1	MC2	MC3	MC4	MC5	MC6	MC7	MC8	MC9	MC10	...	MC173	MC174	MC175	MC176	MC177	MC178	MC179	MC180	MC181	outcome
30860	-0.377629	-0.003778	-0.000284	0.123173	-0.082610	-0.252011	-0.083817	0.137802	-0.059589	0.146982	...	0.033447	-0.004992	0.012373	-0.043422	-0.000974	0.003553	0.006232	0.001033	-0.018414	1.0
126890	-0.467556	0.004930	-0.005359	-0.641571	0.206492	0.242741	-0.226606	-0.014591	-0.472724	0.021588	...	0.010695	-0.000696	0.000022	0.016918	0.003402	-0.003975	-0.003803	0.001092	-0.000510	1.0
28730	-0.413508	-0.005233	0.001155	0.121581	-0.085867	-0.230894	-0.106145	0.189636	-0.061267	0.069930	...	0.015688	0.000374	-0.005150	-0.031315	-0.003929	0.005044	0.008720	0.012316	-0.079917	1.0
31244	-0.287660	-0.003570	0.003981	0.050584	-0.079588	-0.564932	0.493067	0.565763	0.294100	0.394073	...	0.071348	0.005973	0.002630	-0.031388	0.001923	-0.059667	0.008618	0.053812	0.008861	1.0
56105	0.475915	0.012787	-0.010241	0.304675	-0.048947	0.467680	-0.300472	-0.138536	0.078400	0.050529	...	0.021655	-0.007473	0.009129	-0.029546	0.005734	-0.003265	-0.093248	0.005869	-0.020469	1.0
5443	-0.532447	-0.000680	-0.000100	0.149901	-0.052502	0.037784	-0.165450	0.385730	-0.450069	0.097568	...	-0.175609	0.002486	0.005582	-0.060549	-0.006316	-0.007500	0.009540	0.004287	0.003089	1.0
95702	0.488661	0.002546	0.001658	0.293300	-0.075460	0.567510	-0.399183	0.208255	0.236614	-0.575200	...	0.051268	-0.003152	-0.000812	0.022663	-0.002750	-0.011304	0.013903	0.001937	0.017053	1.0
83812	0.618010	-0.008268	0.003282	0.208434	-0.083836	0.068062	0.195019	0.212166	0.059184	-0.127586	...	-0.056974	-0.001156	-0.027070	0.117950	0.000061	0.008795	-0.008416	0.003242	-0.017692	0.0
84338	0.655969	-0.004918	-0.006596	0.294970	-0.113992	0.117014	-0.038303	-0.000417	0.119094	-0.173938	...	-0.025098	-0.003851	-0.015221	0.000430	0.005239	0.005173	-0.009246	0.003599	-0.017117	1.0
124545	0.570068	-0.019301	-0.010219	0.132395	-0.125488	-0.262735	0.254719	-0.193122	-0.455048	0.200752	...	0.018821	0.002103	0.000520	0.011330	-0.000257	0.012774	0.019057	0.005942	0.000166	1.0

10 rows × 182 columns

Table 9.3: MCA Training Data (Without Protected Classes)

	MC1	MC2	MC3	MC4	MC5	MC6	MC7	MC8	MC9	MC10	...	MC92	MC93	MC94	MC95	MC96	MC97	MC98	MC99	MC100	outcome
149746	-0.022191	-0.265491	-0.089926	0.030741	-0.203565	0.045974	0.036976	-0.158538	-0.089747	-0.118819	...	0.018080	-0.010963	-0.029876	-0.003884	-0.001979	0.007441	-0.007867	0.005483	-0.000806	1.0
105015	0.290036	0.178485	-0.157600	-0.271749	-0.245192	0.415834	0.123361	0.055042	-0.280792	-0.105106	...	0.032420	0.008380	-0.046675	-0.011824	0.044305	0.096120	0.266568	0.009842	0.006172	1.0
29094	-0.095327	-0.086033	-0.071726	-0.077756	-0.175656	-0.205904	-0.027044	-0.228361	-0.037807	-0.112685	...	0.011820	-0.023923	0.035317	0.019936	-0.090864	0.017211	0.003241	-0.006380	-0.000049	1.0
101082	-0.654622	0.371221	0.433967	0.351343	0.378372	1.075594	-0.389695	-0.187974	-0.091264	-0.120999	...	-0.048364	0.117014	-0.080124	0.017038	0.005769	0.113322	0.239627	0.041617	0.008369	1.0
77750	-0.011772	0.645262	0.057628	-0.176639	-0.285778	-0.331408	-0.273163	0.149750	-0.234999	0.106382	...	0.236344	-0.074790	-0.100306	-0.338227	-0.034114	0.028328	-0.038353	-0.014300	0.001131	1.0
197336	0.211278	-0.030244	0.227029	-0.003941	0.369867	-0.046471	0.319009	-0.191355	-0.301365	0.207057	...	0.031491	-0.015596	-0.012795	-0.014921	-0.018947	0.004192	-0.022371	0.006839	-0.000022	0.0
137650	-0.358538	0.051337	-0.836959	-0.183025	0.827005	-0.153045	-0.300371	0.067741	-0.113371	-0.321046	...	0.335969	-0.153317	0.006543	0.153396	0.313536	-0.124046	0.047060	-0.003220	0.003096	1.0
136772	-0.224883	-0.192682	-0.084809	0.117577	-0.313160	-0.083558	0.224909	0.148168	-0.065917	-0.105496	...	0.182218	0.384711	0.246450	-0.027070	-0.039627	-0.072453	0.029784	-0.224192	-0.000603	1.0
41734	0.126939	0.621233	0.144349	0.053672	-0.309407	-0.342110	-0.120858	0.063892	-0.205557	-0.164335	...	0.072717	-0.012892	-0.102817	-0.445682	-0.024483	0.166019	-0.124861	-0.005165	0.002151	0.0
10710	0.163013	-0.275794	0.117153	-0.011200	-0.019967	0.048655	0.044069	-0.165311	0.018394	-0.103244	...	-0.030015	0.019834	-0.002256	-0.022811	0.042606	-0.007783	-0.012869	-0.010702	-0.000356	1.0

10 rows × 101 columns

Table 9.4: MCA Testing Data (Without Protected Classes)

	MC1	MC2	MC3	MC4	MC5	MC6	MC7	MC8	MC9	MC10	...	MC92	MC93	MC94	MC95	MC96	MC97	MC98	MC99	MC100	outcome
30860	0.021908	-0.107937	0.160667	-0.096196	0.018201	-0.101238	0.033902	-0.304192	0.099209	-0.167334	...	-0.041614	0.049508	0.024640	0.119880	0.038063	-0.152665	0.053603	-0.017657	-0.001590	1.0
126890	0.480633	-0.445114	0.321141	-0.039117	0.011562	-0.032186	-0.610614	0.551665	0.076726	0.035808	...	-0.031216	0.024834	-0.065209	-0.075517	0.059589	-0.056023	0.004988	0.007007	0.003624	1.0
28730	-0.008824	-0.143329	0.122037	-0.081711	0.180048	-0.110025	0.093520	-0.358804	0.058229	-0.195015	...	-0.070911	0.049980	-0.013948	-0.019076	0.031768	-0.006892	-0.030458	-0.016992	-0.001599	1.0
31244	-0.929934	0.290919	0.916922	0.393976	0.653189	0.575340	-0.367097	-0.433499	-0.010454	0.096068	...	0.139571	0.046301	-0.099484	-0.084230	-0.004915	0.071326	-0.032964	0.019019	0.000421	1.0
56105	0.414472	0.449156	-0.207480	-0.003224	-0.150305	0.032234	-0.191760	0.153439	1.021892	-0.041131	...	-0.055936	0.036294	0.038325	0.232323	0.003879	0.034577	0.006583	0.002948	0.006281	1.0
5443	0.523671	-0.283900	0.655485	-0.123366	0.718045	-0.163190	-0.157075	0.188473	0.113406	0.062999	...	-0.058962	0.001311	0.019947	-0.076058	-0.048344	-0.037792	-0.004104	-0.015193	-0.001536	1.0
95702	0.473073	0.236969	-0.363466	-0.336683	-0.106912	1.000458	0.259367	0.206744	-0.271422	0.230963	...	-0.101931	0.052594	-0.061589	0.005126	-0.186917	-0.079407	-0.302814	-0.002668	-0.006435	1.0
83812	-0.140662	0.424349	0.083509	-0.163135	-0.410733	-0.206931	-0.126389	-0.014700	-0.278051	-0.149923	...	0.003635	0.030724	-0.095295	-0.276234	0.016694	0.262435	-0.092800	-0.011479	0.001361	0.0
84338	0.060723	0.581497	-0.263134	-0.246079	-0.336327	-0.090347	-0.142133	-0.116463	0.962056	0.254970	...	-0.010415	0.008387	-0.065012	-0.232432	0.236782	0.015017	0.039840	-0.005290	0.007312	1.0
124545	-0.019499	-0.310502	-0.078770	0.035220	-0.240221	0.032627	-0.037473	-0.112962	-0.098873	-0.104411	...	0.036539	-0.009720	-0.027945	-0.016062	0.043209	0.004422	0.017510	0.006984	0.000563	1.0

10 rows × 101 columns

Notice that each has a different number of components. Each is structured to a number of components required to explain at least approximately 99.99% of the variance in each dataset.

Similarly as in Chapter 7 and Chapter 8, a 500 iteration randomization of source data test was performed against both models to compare a difference in means.

9.2.2 Multinomial Naive Bayes

The description for MultinomialNB data and code can be found in Chapter 7.

Table 9.5: Initial Data Used

	state_code	county_code	derived_sex	action_taken	preapproval	open-end_line_of_credit	loan_amount	loan_to_value_ratio	interest_rate	...	tract_median_age_of_housing_units	applicant_race	co-applicant_race	applicant_ethnicity	co-applicant_ethnicity	aus	denial_reason	outcome	company	income_from_median
0	OH	39153.0	Sex Not Available	1	2	2	665000.0	85.000	4.250	...	36	32768	131072	32	128	64	512	1.0	JP Morgan	True
1	NY	36061.0	Male	1	2	2	755000.0	21.429	4.250	...	0	32768	262144	64	256	64	512	1.0	JP Morgan	False
2	NY	36061.0	Sex Not Available	1	1	2	965000.0	80.000	5.250	...	0	65536	262144	64	256	64	512	1.0	JP Morgan	False
3	FL	12011.0	Male	1	2	2	705000.0	92.175	5.125	...	12	32768	262144	32	256	64	512	1.0	JP Morgan	False
4	MD	24031.0	Joint	1	2	2	1005000.0	65.574	5.625	...	69	66	32768	32	32	64	512	1.0	JP Morgan	False
5	NC	37089.0	Joint	1	1	2	695000.0	85.000	6.000	...	39	32768	32768	32	32	64	512	1.0	JP Morgan	False
6	CA	6073.0	Joint	2	2	2	905000.0	75.000	6.250	...	44	2	2	32	32	64	512	1.0	JP Morgan	False
7	NY	36061.0	Sex Not Available	2	1	2	355000.0	15.909	5.625	...	63	65536	65536	64	64	64	512	1.0	JP Morgan	False
8	NY	36061.0	Joint	1	1	2	1085000.0	90.000	5.625	...	75	32768	32768	32	32	64	512	1.0	JP Morgan	True
9	MO	29189.0	Sex Not Available	2	1	2	405000.0	53.333	5.750	...	0	65536	65536	64	64	64	512	1.0	JP Morgan	True

10 rows × 51 columns

Table 9.6: MultinomialNB Training Data (With protected classes)

	derived_sex_Female	derived_sex_Joint	derived_sex_Male	derived_sex_Sex Not Available	purchaser_type_0	purchaser_type_1	purchaser_type_3	...	co-applicant_ethnicity_No Co-applicant	aus_Desktop Underwriter	aus_Loan Prospector/Product Advisor	aus_Other	aus_Internal Proprietary	outcome
149746	1	0	0	0	0	1	0	...	1	1	0	0	0	1.0
105015	0	0	1	0	1	0	0	...	1	1	0	0	0	1.0
29094	0	1	0	0	0	1	0	...	0	1	1	1	0	1.0
101082	0	0	1	0	1	0	0	...	1	1	0	0	0	1.0
77750	1	0	0	0	1	0	0	...	1	0	0	0	1	1.0
197336	0	1	0	0	1	0	0	...	0	0	1	0	0	0.0
137650	0	0	0	1	0	1	0	...	0	1	0	0	0	1.0
136772	1	0	0	0	0	1	0	...	1	1	0	0	0	1.0
41734	1	0	0	0	1	0	0	...	1	0	1	0	1	0.0
10710	1	0	0	0	0	0	1	...	1	0	1	1	0	1.0

10 rows × 244 columns

Table 9.7: MultinomialNB Testing Data (With protected classes)

	derived_sex_Joint	derived_sex_Male	derived_sex_Sex Not Available	purchaser_type_0	purchaser_type_3	...	co-applicant_ethnicity_No Co-applicant	aus_Desktop Underwriter	aus_Loan Prospector/Product Advisor	aus_Other	aus_Internal Proprietary	aus_Not applicable	outcome
30860	0	1	0	0	1	...	1	1	1	1	0	0	1.0
126890	0	0	1	0	1	...	1	1	0	0	0	0	1.0
28730	0	1	0	0	1	...	1	1	1	1	0	0	1.0
31244	0	1	0	0	1	...	1	1	1	1	0	0	1.0
56105	0	1	0	0	1	...	0	0	1	0	0	0	1.0
5443	0	1	0	0	1	...	1	0	1	0	0	0	1.0
95702	1	0	0	1	0	...	0	0	0	0	0	1	1.0
83812	1	0	0	1	0	...	0	0	0	0	1	0	0.0
84338	1	0	0	0	1	...	0	0	1	0	0	0	1.0
124545	1	0	0	0	1	...	0	1	0	0	0	0	1.0

10 rows × 244 columns

Table 9.8: MultinomialNB Training Data (No protected classes)

	purchaser_type_0	purchaser_type_1	purchaser_type_3	preapproval_2	...	tract_median_age_of_housing_units_M	tract_median_age_of_housing_units_MH	tract_median_age_of_housing_units_ML	company_Bank of America	company_JP Morgan	company_Navy Federal Credit Union	company_Rocket Mortgage	company_Wells Fargo	outcome
149746	0.0	1.0	0.0	1.0	...	1.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0
105015	1.0	0.0	0.0	1.0	...	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0
29094	0.0	1.0	0.0	1.0	...	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0
101082	1.0	0.0	0.0	1.0	...	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	1.0
77750	1.0	0.0	0.0	1.0	...	0.0	1.0	0.0	0.0	0.0	0.0	0.0	1.0	1.0
197336	1.0	0.0	0.0	1.0	...	1.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0
137650	0.0	1.0	0.0	1.0	...	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	1.0
136772	0.0	1.0	0.0	1.0	...	1.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0
41734	1.0	0.0	0.0	1.0	...	0.0	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0
10710	0.0	0.0	1.0	1.0	...	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0

10 rows × 176 columns

Table 9.9: MultinomialNB Training Data (No protected classes)

	purchaser_type_0	purchaser_type_3	preapproval_2	...	tract_median_age_of_housing_units_M	tract_median_age_of_housing_units_MH	tract_median_age_of_housing_units_ML	company_Bank of America	company_JP Morgan	company_Navy Federal Credit Union	company_Rocket Mortgage	company_Wells Fargo	outcome
30860	0.0	1.0	1.0	...	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0
126890	0.0	1.0	1.0	...	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0
28730	0.0	1.0	1.0	...	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0
31244	0.0	1.0	1.0	...	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0
56105	0.0	1.0	1.0	...	0.0	0.0	1.0	1.0	0.0	0.0	0.0	0.0	1.0
5443	0.0	1.0	1.0	...	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0
95702	1.0	0.0	1.0	...	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0
83812	1.0	0.0	1.0	...	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0
84338	0.0	1.0	1.0	...	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	1.0
124545	0.0	1.0	1.0	...	1.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0

10 rows × 176 columns

9.3 Code

The code for both logistic regression can be found in Appendix F and multinomial naive bayes in Appendix D.

9.4 Results

9.4.1 Logistic Regression

Figure 9.1: Logistic Regression Modeling Results

Table 9.10: Logistic Regression Model Metrics

Model	Data	Accuracy	Precision	Recall	F1	ROC-AUC
Logistic Regression	With Protected Classes	0.961859	0.983022	0.972197	0.977579	0.936481
Logistic Regression	Without Protected Classes	0.958859	0.981304	0.970385	0.975814	0.930562

There is a chance that the difference in performance of the models trained with and without protected class data could be due to random chance, namely the random splitting of training and testing data from the source datasets.

To examine the potential for the difference (albeit minute) between both modeling methods, one can conduct a randomization test on the source data itself to examine for a statistically significant difference in means.

To do so, one can shuffle the data multiple times into new training and testing datasets, re-train the models, and capture the performance scores of each model (one trained with, and one trained without, protected classes). From here, one achieve’s a distribution of the performance scores (such as Accuracy, Precision, Recall, F1, and ROC-AUC scores) of each model; when the number of randomizations increase, the distribution of each should approach a normal distribution.

To perform the randomization, the data was shuffled 500 times, and two models were trained on each shuffle, capturing the aforementioned scores. When these shuffles were executed, the following outcomes were achieved:

Table 9.11: Per Metric Paired Z-test (Logistic Regression, 500 Iterations)

Stat	z-score	top performer	top mean	difference in means
Accuracy	44.648798	With Protected Classes	0.963586	0.002457
Precision	47.173034	With Protected Classes	0.983867	0.001985
Recall	16.219786	With Protected Classes	0.973386	0.000889
F1	43.977874	With Protected Classes	0.978598	0.001431
ROC-AUC	50.590439	With Protected Classes	0.939526	0.006305

One can see that over 500 iterations of shuffling nearly 200k records, that the model trained on the multiple correspondence analysis that included protected class information had statistically better performance across all metrics.

This is a substantial finding. Namely, these significant difference signifies that including protected class information in logisitic regression confidently improves its predictive performance, better than if it were excluded as part of the model training data.

Furthermore, in comparison to all Naive Bayes models and all decision tree models, including their randomization testing, the Logistic Regression model outperforms them all in every metric. The most discerning factor is the fact that the ROC-AUC score for Logistic Regression is in the 90s, whereas few if any other ROC-AUCs for other models exceeded 89%.

What does this mean? It means that, if using logistic regression modeling to assess whether or not a loan should be approved, that a company could choose to include protected class data when building a linear regression model if they’re concerned about model performance.

Ethically speaking, however, it should be excluded outright. This is further evidenced by the overall difference in performance between two models trained on the exact same data, less protected class variable presence. The difference, while statistically significant, is not operationally significant, as the maximum difference in means for each performance metric is less than 0.6%.

Leveraging that ethical perspective, while the difference is significantly different, from a mathematical standpoint, that models leveraging protected class information outperform those that exclude it, the cost is minimal. Exclustion of protected class information still achieves incredibly high accuracy, precision, Recall, F1, and ROC-AUC, all ranging from 93%-98% to make predictions. Such modeling can be leveraged to inform one as to the likely outcome of the loan application, and can be leveraged in conjunction with other available relevant information to make an informed decision.

9.4.2 Multinomial Naive Bayes

The results here are the same as included and described within Chapter 7. The findings are the same.

Figure 9.3: Confusion Matrices (MultinomialNB, Single-Run)

Table 9.12: Model Performance Metrics (MultinomialNB, Single Run)

Model	Data	Accuracy	Precision	Recall	F1	ROC-AUC
MultinomialNB	With Protected Classes	0.936358	0.967880	0.957361	0.962591	0.884798
MultinomialNB	Without Protected Classes	0.939653	0.969309	0.959833	0.964548	0.890112

Figure 9.4: Metric Kernel Density Estimates (500 randomizations, MultinomialNB)

Table 9.13: Statistical Significance Tests (Model Performance Metrics, MultinomialNB)

Model	Stat	z-score	p-value	top performer	top mean	difference in means
MultinomialNB	Accuracy	-1.242399	0.214090	Without Protected Classes	0.804602	0.000175
MultinomialNB	Precision	12.735062	0.000000	With Protected Classes	0.914013	0.000778
MultinomialNB	Recall	-7.701771	0.000000	Without Protected Classes	0.852538	0.001103
MultinomialNB	F1	-2.489032	0.012809	Without Protected Classes	0.881842	0.000228
MultinomialNB	ROC-AUC	10.578913	0.000000	With Protected Classes	0.689026	0.002104

Examining MultinomialNB’s performance, it seems to have fairly low performance in terms of accuracy. This is likely caused by the fact that it was provided with Bernoulli data, as it was the only way to obtain count data from the source records. That being said, the performance is quite precise.

Across 500 iterations of MultinomialNB, there was not a significant difference in accuracy or F1 score of the models when they were and when they weren’t trained with protected class information. For other cases, the difference, while statistically significant, was within 1 percentage point of the top performer. Operationally speaking, there’s not a substantial need for that level of performance. The protected class information inclusion or exclusion makes a statistically significant, but not operationally impactful, difference in model performance.

9.5 Overall

The performance of Logistic Regression, with and without protected classes, far outshined the performance of Multinomial Naive Bayes. Multinomial Naive Bayes may have had better performance had the data provided the opportunity to better be count-vectorized, and as such is better suited for assessing document classification than individual record classification.

Another difference in the models is what they do and how they do it. Logisitic Regression is discriminative, meaning that we know the potential output classes and one crafts the model to maximize the likelihood of predicting a correct probability for a class, given new input data. MultinomialNB, however, is generative and seeks to identify the probability of the data, given a class.

Thus far, Logistic Regression’s accuracy (along with other metrics) makes it a top contender for modeling. This, however, comes at the cost of substantial dimensionality to explain approximately 99.99% of the variance in the data with the MCA execution. If dimensionality reduction, computational memory and time were a constraint, it may render the execution of future predictions with logisitic regression infeasible. This especially becomes the case when the model needs to be refit with additional new data; the sheer volume of features and row vectors produces a tremendous amount of data (near 1 Gb), and the execution of gradient descent upon said data to produce a well-performing logisitic regression is also computationally expensive and time consuming.