David W. Hosmer and Stanley Lemeshow,
Applied Logistic Regression, Chapter 8.
Special Topics, 8.1 Polytomous Logistic Regression, pages 216 – 238.
(Polytomous logistic regression was the former name for multinomial logistic
regression).
Chapter 9, Multinomial Logistic
Regression Examples in SPSS Regression Models 10.0, page 65-82.
Binary logistic regression
is an effective tool for analyzing relationships when the dependent variable
has two categories. When the dependent
variable has more than two categories, we can analyze the relationship with multinomial
logistic regression.
In binary logistic
regression, the analysis focuses on finding which independent variables
increase the likelihood that a subject is a member a particular group specified
by the dependent variable, rather than a member of the other group specified by
the dependent variable. The "other group" functions as a reference
group, and acts as a baseline against which we compare membership in the
particular group we are interested in.
In effect, we use the
dependent variable in a logistic regression like a dummy-coded independent
variable in a multiple regression. We
dummy-code a two-value categorical variable in multiple regression by assigning
one category a code of 1 and the other category a code of 0. The regression coefficient for this
independent variable in a multiple regression equation specifies the change in
the dependent variable associated with a change from the 0 category to the 1
category. For example, suppose the
dependent variable was salary and the independent variable was sex, coded so
that 1 represented males and 0 represented females. A regression coefficient of $800 would mean that males in our
sample averaged $800 more in salary than their female counterparts. The group coded 0 represents the reference
or baseline group, and the regression coefficient represents the difference in
scores on the dependent variable for the group coded 1.
If we extend the example of
dummy coding to a three-group independent variable, the regression coefficients
represent the differences for two of the groups from the third group which acts
as a reference or baseline for both of the other groups. Suppose for example,
that the dependent variable was scores on a depression inventory and the
independent variable was Marital Status coded as 1 = married, 2 = marital
breakdown, and 3 = never married. If we
used the category "never married" as the reference group, we would
dummy code two variables for the other two categories into two new variables,
Married and Brkdown. The dummy coding is
shown in the table below:
|
Marital Status |
Married |
Brkdown |
|
1
= married |
1 |
0 |
|
2
= marital breakdown |
0 |
1 |
|
3
= never married |
0 |
0 |
The category "never
married" is the reference or baseline category, and the regression coefficients
for the two other variables represent the difference between each of the other
categories and the reference group.
Both of the dummy-coded variables are compared to the same reference
group. The regression coefficient for
the Married variable represents the difference associated with being married
rather than never having married. The
regression coefficient for the Brkdown variable represents the difference
associated with having a marital breakdown versus never having married.
Binary logistic regression
compares one of the categories of the dependent variable to the other category,
which acts as a reference or baseline group.
Similarly in multinomial logistic regression with three or more groups,
one group acts as a reference or baseline group, and the other groups of the
dependent variable are contrasted to the reference group. One logistic regression equation is computed
for each category other than the reference group. Suppose, for example, that we were studying the relationship of
difference demographic variable to a dependent variable of marital status coded
as 1 = married, 2 = marital breakdown, and 3 = never married. Multinomial logistic regression would derive
two logistic regression equations: one comparing membership in the 1 = married
group to the 3 = never married group, and a second comparing membership in the
2 = marital breakdown group to the 3 = never married group.
We can think of multinomial
logistic regression as an extension of binary logistic regression. In multinomial logistic regression, we are
looking at the odds of being in one of several different dependent variable
groups rather than being in the baseline or reference group. SPSS uses the highest number category as the
reference category. If this default
selection is not suitable for the analysis, the dependent variable must be
recoded prior to running the multinomial logistic regression procedure.
Like a binary logistic
regression, we measure overall fit or relationship between the independent
variables and the dependent variable with a Model Chi-square statistic and test
of significance. The utility of the
model is measured by pseudo-R2 measures and classification
accuracy. We can look at the
B-coefficients and standard errors for each of the logistic equations for
indications of numerical problems, such as multicollinearity.
Interpreting
the relationships between individual predictors and group membership is
complicated by the fact that there are multiple equations to interpret, similar
to the interpretation problem for the role of individual variables in
discriminant analysis. For each
logistic regression equation, a set of coefficients, Wald statistics and
probability values, and odds ratios are output by SPSS. The odds ratios are specific to the comparison
between each group and the reference group.
In
addition to the Wald tests for individual coefficients and each pair of groups,
SPSS computes a "Likelihood Ratio Test" for each independent variable
and the dependent variable. This is a test
of the contribution or effect of each independent variable to the overall
model, and is based on the difference in –2 log-likelihood if the variable were
removed from the final model. If an
independent variable is not important to the overall model, it will not show a
large change in the –2 log-likelihood measure, the chi-square difference will
not be statistically significant, and we can conclude that there is no
relationship between this independent variable and the dependent variable. If there is a statistically significant
relationship, we can look at the pattern of significance on the individual Wald
statistics to interpret the role of the variable in predicting membership in
dependent variable categories. The SPSS
manual identifies the "Likelihood ratio test" as more effective in
identifying relationships than the Wald statistics for the individual logistic
regression equations.
We have not spent much time
on the mechanics of classification for either discriminant analysis or binary
logistic regression because SPSS has done all of the calculations needed for
our analysis. However, while
multinomial logistic regression will classify cases, it has no facility for
cross-validation or selecting subsets of cases. When we do the validation analysis, we will have to do all of the
commands for splitting the sample, selecting subsets, and computing the
logistic regression equations, and classification calculations. To understand what we will be doing, we will
look at an overview of the classification process as it would apply to a
three-group problem. Extensions to
problems with larger number of groups can be derived from this discussion.
In a three group problem,
two logistic regression equations are obtained. For each case in the sample, we can substitute the values of the
independent variables and obtain the scores on the logistic regression
equations. These logistic regression
scores are the log of the odds of belonging to each group. The first logistic regression score is the log
of the odds of belonging to the first group rather than the third (reference)
group. We will call the first logistic
regression score g1. The second logistic regression scores is the
log of the odds of belonging to the second group rather than the third
(reference) group. We will call the
second logistic regression score
g2. The log of the odds of
belonging to the third (reference) group is g3, which is 0 because all of the
coefficients for the third equation for the third group are 0 (i.e. 0 + 0 × IV1 + 0 × IV2 + 0 × IV3 + …).
The scores g1, g2, and g3
are log estimates of the odds of belonging to each group. To convert the scores into a probability of
group membership, we convert each score into its antilog equivalent with the
EXP function and divide by the sum of the three antilog equivalents. To estimate group membership, we compare the
three probabilities, and predict that the subject is a member of the group that
has the highest probability.
We complete the validation
analysis by having SPSS create a crosstabulated table with actual group
membership in the rows and predicted group membership in the columns. If we request that SPSS include total
percents in the crosstab table, we can sum the total percents on the main
left-to-right diagonal to compute the accuracy rate.
The dataset for this problem is: Voter.Sav
The dataset for this problem is: ActivityPreferenceInventory.Sav.