4 Multiple Correspondence Analysis

4.1 Overview

Many of the research questions within this effort are inherently linked to categorical factors in lieu of numeric factors. Principal component analysis is only executable upon numeric data and not upon categorical data - even in the case of ordinal data. As a simple example, consider a ordinal variable “size” with categories small, medium, large, and extra-large. One could apply a simple encoding and assign small=1, medium=2, large=3, and extra-large=4. This encoding, while apparently holding a degree of validity in terms of increasing size, does not match up mathematically to reality. Consider getting a fountain drink at a fast-food restaurant and ask the question - is a large the same as 3 smalls? Is an extra large the same as 1 medium and two smalls? Rarely are either of these answers “yes”. One might have to add decimal places to the categories, and at that point, one may as well get the exact size measurements in terms of fluid ounces or liters, which may or may not be possible.

The additive and multiplicative challenges between these categories when assigning them a value produces challenges for ordinal variables. These challenges are further confounded when pivoting away from an ordinal variables. One runs the risk of making mathematical claims such as red is 4 times blue, or that sad is 3 less than happy. Such statements are nonsensical, have no foundation in mathematics, and while they may produce results in a model post-transformation, do not hold validity, explainability, or generalizability.

Enter Multiple Correspondence Analysis (or MCA). MCA performs an analogous action on categorical variables as PCA performs upon numeric variables. To perform an MCA, one must construct a Complete Disjunctive Table, which is effectively a one-hot encoded matrix. One takes the source categorical columns and transforms them to a column per category, and for the new column, the value is set to 1 if the current row is a member of the category, and zero otherwise. This is repeated for all columns and categories until the dataset is fully expanded.

a	b
s	f
m	w
s	s
s	f
m	f
m	f
l	w
l	s
l	f
m	w

Taking the above example table, one can transform it to a one-hot encoded table:

a_l	a_m	a_s	b_f	b_s	b_w
0	0	1	1	0	0
0	1	0	0	0	1
0	0	1	0	1	0
0	0	1	1	0	0
0	1	0	1	0	0
0	1	0	1	0	0
1	0	0	0	0	1
1	0	0	0	1	0
1	0	0	1	0	0
0	1	0	0	0	1

Notice that there are now have 6 columns from the original 2 columns. This is because column ‘a’ had 3 categories - s/m/l, as did column ‘b’ - s/w/f. A column is created for each combination of individual columns and their respective categories, hence 6 columns in this case.

After performing this transformation, the following mathematical operations are applied:

Calculate the sum of all values (0s and 1s) from the CDT as value \(N\)
Calculate matrix \(Z = \frac{CDT}{N}\)

	0	1	2	3	4	5
0	0.00	0.00	0.05	0.05	0.00	0.00
1	0.00	0.05	0.00	0.00	0.00	0.05
2	0.00	0.00	0.05	0.00	0.05	0.00
3	0.00	0.00	0.05	0.05	0.00	0.00
4	0.00	0.05	0.00	0.05	0.00	0.00
5	0.00	0.05	0.00	0.05	0.00	0.00
6	0.05	0.00	0.00	0.00	0.00	0.05
7	0.05	0.00	0.00	0.00	0.05	0.00
8	0.05	0.00	0.00	0.05	0.00	0.00
9	0.00	0.05	0.00	0.00	0.00	0.05

Calculate the column-wise sum as matrix \(c\). Transform to a diagonal matrix \(D_c\)

c:
Dc:

	0	1	2	3	4	5
0	0.15	0.2	0.15	0.25	0.1	0.15

	0	1	2	3	4	5
0	0.15	0.0	0.00	0.00	0.0	0.00
1	0.00	0.2	0.00	0.00	0.0	0.00
2	0.00	0.0	0.15	0.00	0.0	0.00
3	0.00	0.0	0.00	0.25	0.0	0.00
4	0.00	0.0	0.00	0.00	0.1	0.00
5	0.00	0.0	0.00	0.00	0.0	0.15

Calculate the row-wise sum as matrix \(r\). Transform to a diagonal matrix \(D_r\)

r:
Dr:

	0
0	0.1
1	0.1
2	0.1
3	0.1
4	0.1
5	0.1
6	0.1
7	0.1
8	0.1
9	0.1

	0	1	2	3	4	5	6	7	8	9
0	0.1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.1	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.1	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.1	0.0	0.0	0.0	0.0	0.0
5	0.0	0.0	0.0	0.0	0.0	0.1	0.0	0.0	0.0	0.0
6	0.0	0.0	0.0	0.0	0.0	0.0	0.1	0.0	0.0	0.0
7	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.1	0.0	0.0
8	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.1	0.0
9	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.1

Calculate matrix \(M = D_r^{-\frac{1}{2}}(Z-rc^T)D_c^{-\frac{1}{2}}\)

	0	1	2	3	4	5
0	-0.122474	-0.141421	0.285774	0.158114	-0.1	-0.122474
1	-0.122474	0.212132	-0.122474	-0.158114	-0.1	0.285774
2	-0.122474	-0.141421	0.285774	-0.158114	0.4	-0.122474
3	-0.122474	-0.141421	0.285774	0.158114	-0.1	-0.122474
4	-0.122474	0.212132	-0.122474	0.158114	-0.1	-0.122474
5	-0.122474	0.212132	-0.122474	0.158114	-0.1	-0.122474
6	0.285774	-0.141421	-0.122474	-0.158114	-0.1	0.285774
7	0.285774	-0.141421	-0.122474	-0.158114	0.4	-0.122474
8	0.285774	-0.141421	-0.122474	0.158114	-0.1	-0.122474
9	-0.122474	0.212132	-0.122474	-0.158114	-0.1	0.285774

Perform Matrix decomposition on \(M\):
seek two unitary matrices (e.g. of total length 1), P and Q, and the generalized diagonal matrix of singular values \(\Delta\) such that \(M=P\Delta Q^T\)

P:
Q:

	0	1	2	3	4	5
0	0.266849	-0.369737	-0.029655	-0.335052	0.407906	0.052569
1	-0.448558	0.052436	0.315562	-0.066040	-0.618981	0.175176
2	0.502221	0.093666	0.563688	0.073941	-0.177477	-0.434748
3	0.266849	-0.369737	-0.029655	-0.335052	-0.448932	-0.204336
4	-0.168262	-0.293458	-0.127323	0.421015	-0.002788	-0.406828
5	-0.168262	-0.293458	-0.127323	0.421015	-0.002788	-0.406828
6	-0.199414	0.452079	-0.196143	-0.498964	0.067258	-0.618150
7	0.316254	0.569588	-0.045685	0.397085	0.031808	0.043738
8	0.080882	0.106185	-0.639028	-0.011908	-0.317570	-0.012104
9	-0.448558	0.052436	0.315562	-0.066040	0.333219	-0.143542

	0	1	2	3	4	5
0	0.093132	-0.503227	0.487944	0.101450	0.472167	-0.516494
1	0.584231	-0.216247	-0.334531	-0.489600	0.420783	0.288503
2	-0.584231	0.216247	0.334531	-0.489600	0.420783	0.288503
3	-0.093132	0.503227	-0.487944	0.101450	0.472167	-0.516494
4	0.545100	0.629428	0.545100	-0.069109	-0.043709	-0.053532
5	0.053532	0.061813	0.053532	0.703721	0.445073	0.545100

\(\Delta^2\) provides the eigenvalues of the target matrix.

	0
0	7.512071e-01
1	6.211313e-01
2	3.788687e-01
3	2.487929e-01
4	1.954416e-32
5	3.228308e-33

Use the eigenvalues to apply transformations of the input data into a new the new eigenbasis.

To maximize the use of variables within the dataset and to support the answering of various research questions, performing MCA on transformations of the data is necessary.

4.2 Data & Code

The data used the source data for this effort, converted into a one-hot encoded / sparse matrix format of the data. The code and applied transformations can be seen in Appendix G.

The completed transformed data has two forms. One of these forms is a transformation that includes all protected class variables (age, gender, and race), and the second form does not contain these variables. These two different forms allow exploration of the research questions for this effort.

(add notes on how numerics were converted to categories).

4.3 Results

Table 4.1: MCA Summary of Eigenvalues (with protected class information) - Top 5

	eigenvalue	% of variance	% of variance (cumulative)
0	0.215	4.69%	4.69%
1	0.167	3.63%	8.32%
2	0.166	3.62%	11.93%
3	0.115	2.49%	14.43%
4	0.102	2.21%	16.64%

Table 4.2: MCA Summary of Eigenvalues (with protected class information) - Bottom 5

	eigenvalue	% of variance	% of variance (cumulative)
176	0.001	0.02%	99.96%
177	0.001	0.01%	99.97%
178	0.000	0.01%	99.98%
179	0.000	0.01%	99.99%
180	0.000	0.01%	100.00%

Table 4.3: MCA Summary of Eigenvalues (without protected class information) - Top 5

	eigenvalue	% of variance	% of variance (cumulative)
0	0.124	3.21%	3.21%
1	0.116	3.01%	6.23%
2	0.100	2.61%	8.83%
3	0.086	2.23%	11.06%
4	0.085	2.22%	13.28%

Table 4.4: MCA Summary of Eigenvalues (without protected class information) - Bottom 5

	eigenvalue	% of variance	% of variance (cumulative)
95	0.009	0.24%	99.52%
96	0.007	0.17%	99.69%
97	0.007	0.17%	99.86%
98	0.004	0.11%	99.97%
99	0.001	0.03%	100.00%

MCA doesn’t necessarily provide direct dimensionality reduction, but does enable one to reduce dimensions from a sparse matrix. Instead it enables use of more variables while also (potentially) increasing the data’s dimensionality. The sparse matrices to produce the transformations had 243 columns (with protected) and 179 columns (without protected).

The outcomes of the transformations allow substantial dimensionality reduction from these sparse matrices. With 181 components, the data containing protected class information achieved (99.99%) explained variance in the source data (reduction of 62 features). Similarly, with 100 components, the data excluding protected class information achieved 99.99% explained variance (reduction of 79 features).

MCA also provides us with insight as to which columns provide the greatest contributions to each primary component in the output data. Exploring some of these provides interesting insights. To explore these, we’ll look at the first 3 components, sorted in descending order, and examine which columns from the source data provide the strongest contributions to the transformation:

4.3.1 With Protected Classes

Table 4.5: MC1 (with protected class information) - Top 10 Column Contributors

	Column	MC1
121	co-applicant_sex_observed_2	0.061418
96	co-applicant_ethnicity_observed_2	0.061381
103	co-applicant_race_observed_2	0.061369
1	derived_sex_Joint	0.053307
90	co-applicant_credit_score_type_9	0.047954
229	co-applicant_ethnicity_Not Hispanic/Latino	0.046580
98	co-applicant_ethnicity_observed_4	0.045165
140	co-applicant_age_8.0	0.045165
105	co-applicant_race_observed_4	0.045165
123	co-applicant_sex_observed_4	0.045165

Table 4.6: MC2 (with protected class information) - Top 10 Column Contributors

	Column	MC2
196	applicant_race_Not Applicable	0.141186
131	applicant_age_7.0	0.141186
109	applicant_sex_4	0.141186
101	applicant_race_observed_3	0.141186
94	applicant_ethnicity_observed_3	0.141186
119	applicant_sex_observed_3	0.141186
223	applicant_ethnicity_Not Applicable	0.141186
231	co-applicant_ethnicity_Not Applicable	0.001395
214	co-applicant_race_Not Applicable	0.001395
139	co-applicant_age_7.0	0.001395

Table 4.7: MC3 (with protected class information) - Top 10 Column Contributors

	Column	MC3
231	co-applicant_ethnicity_Not Applicable	0.141338
214	co-applicant_race_Not Applicable	0.141338
139	co-applicant_age_7.0	0.141338
122	co-applicant_sex_observed_3	0.141338
114	co-applicant_sex_4	0.141338
104	co-applicant_race_observed_3	0.141338
97	co-applicant_ethnicity_observed_3	0.141338
119	applicant_sex_observed_3	0.001412
223	applicant_ethnicity_Not Applicable	0.001412
94	applicant_ethnicity_observed_3	0.001412

Examining the above three tables, it is evident that features revolving around protected class information contribute substantially to each Multiple Correspondence Component (MC). For each of the MCs, much of the contributions come from feature values revolving around all protected classes of age, sex, race, and ethnicity. Using data of this nature could easily produce predictive outcomes of models trained with with biases (either for or against) testing data that fits within these categories.

4.3.2 Without Protected Classes

Table 4.8: MC1 (without protected class information) - Top 10 Column Contributors

	Column	MC1
29	origination_charges_H	0.088933
25	total_loan_costs_H	0.084998
32	discount_points_H	0.076381
14	loan_amount_ML	0.054607
27	total_loan_costs_MH	0.049097
31	origination_charges_MH	0.047695
50	property_value_ML	0.043231
34	discount_points_MH	0.033078
0	purchaser_type_0	0.032864
9	open-end_line_of_credit_1	0.030098

Table 4.9: MC2 (without protected class information) - Top 10 Column Contributors

	Column	MC2
47	property_value_H	0.078352
11	loan_amount_H	0.075937
81	applicant_credit_score_type_9	0.062141
56	income_MH	0.051055
0	purchaser_type_0	0.051019
44	intro_rate_period_H	0.042191
121	company_Bank of America	0.040706
9	open-end_line_of_credit_1	0.034975
124	company_Rocket Mortgage	0.034505
49	property_value_MH	0.031335

Table 4.10: MC3 (without protected class information) - Top 10 Column Contributors

	Column	MC3
107	tract_owner_occupied_units_H	0.089676
112	tract_one_to_four_family_homes_H	0.080570
111	tract_owner_occupied_units_ML	0.077914
88	tract_population_H	0.073417
115	tract_one_to_four_family_homes_MH	0.056009
110	tract_owner_occupied_units_MH	0.051876
116	tract_one_to_four_family_homes_ML	0.049923
106	tract_to_msa_income_percentage_ML	0.044282
92	tract_population_ML	0.037601
91	tract_population_MH	0.037590

Examining the MCs for data that exludes protected class information, we immediately see that each holds data that is likely highly relevant to making a decision of whether or not to approve a loan. The _H and _MH in MC1 column contributions signify that the loan is between 1 and 2 standard deviations, or over 2 standard deviations from the mean. With high loan costs, high origination charges, high discount points and so forth all are likely candidates to impact the decision making process of whether or not to grant a loan.

It’s also interesting to see that in MC2, two banks stand out - Bank of America and Rocket Mortgage, apparently having a higher influence on the loan outcome in the 2nd MC.

4.3.3 Clustering