Class 15 Worksheet


Readings (on reserve in LRC)

David W. Hosmer and Stanley Lemeshow, Applied Logistic Regression, Chapter 8.  Special Topics, 8.1 Polytomous Logistic Regression, pages 216 – 238. (Polytomous logistic regression was the former name for multinomial logistic regression).

Chapter 9, Multinomial Logistic Regression Examples  in SPSS Regression Models 10.0, page 65-82.


Multinomial Logistic Regression

Binary logistic regression is an effective tool for analyzing relationships when the dependent variable has two categories.  When the dependent variable has more than two categories, we can analyze the relationship with multinomial logistic regression.

 

In binary logistic regression, the analysis focuses on finding which independent variables increase the likelihood that a subject is a member a particular group specified by the dependent variable, rather than a member of the other group specified by the dependent variable. The "other group" functions as a reference group, and acts as a baseline against which we compare membership in the particular group we are interested in.

 

In effect, we use the dependent variable in a logistic regression like a dummy-coded independent variable in a multiple regression.  We dummy-code a two-value categorical variable in multiple regression by assigning one category a code of 1 and the other category a code of 0.  The regression coefficient for this independent variable in a multiple regression equation specifies the change in the dependent variable associated with a change from the 0 category to the 1 category.  For example, suppose the dependent variable was salary and the independent variable was sex, coded so that 1 represented males and 0 represented females.  A regression coefficient of $800 would mean that males in our sample averaged $800 more in salary than their female counterparts.  The group coded 0 represents the reference or baseline group, and the regression coefficient represents the difference in scores on the dependent variable for the group coded 1.

 

If we extend the example of dummy coding to a three-group independent variable, the regression coefficients represent the differences for two of the groups from the third group which acts as a reference or baseline for both of the other groups. Suppose for example, that the dependent variable was scores on a depression inventory and the independent variable was Marital Status coded as 1 = married, 2 = marital breakdown, and 3 = never married.  If we used the category "never married" as the reference group, we would dummy code two variables for the other two categories into two new variables, Married and Brkdown.  The dummy coding is shown in the table below:

 

Marital Status

Married

Brkdown

1 = married

1

0

2 = marital breakdown

0

1

3 = never married

0

0

 

The category "never married" is the reference or baseline category, and the regression coefficients for the two other variables represent the difference between each of the other categories and the reference group.  Both of the dummy-coded variables are compared to the same reference group.  The regression coefficient for the Married variable represents the difference associated with being married rather than never having married.  The regression coefficient for the Brkdown variable represents the difference associated with having a marital breakdown versus never having married.

 

Binary logistic regression compares one of the categories of the dependent variable to the other category, which acts as a reference or baseline group.  Similarly in multinomial logistic regression with three or more groups, one group acts as a reference or baseline group, and the other groups of the dependent variable are contrasted to the reference group.  One logistic regression equation is computed for each category other than the reference group.  Suppose, for example, that we were studying the relationship of difference demographic variable to a dependent variable of marital status coded as 1 = married, 2 = marital breakdown, and 3 = never married.  Multinomial logistic regression would derive two logistic regression equations: one comparing membership in the 1 = married group to the 3 = never married group, and a second comparing membership in the 2 = marital breakdown group to the 3 = never married group.

 

We can think of multinomial logistic regression as an extension of binary logistic regression.  In multinomial logistic regression, we are looking at the odds of being in one of several different dependent variable groups rather than being in the baseline or reference group.  SPSS uses the highest number category as the reference category.  If this default selection is not suitable for the analysis, the dependent variable must be recoded prior to running the multinomial logistic regression procedure.

 

Like a binary logistic regression, we measure overall fit or relationship between the independent variables and the dependent variable with a Model Chi-square statistic and test of significance.  The utility of the model is measured by pseudo-R2 measures and classification accuracy.  We can look at the B-coefficients and standard errors for each of the logistic equations for indications of numerical problems, such as multicollinearity.

 

Interpreting the relationships between individual predictors and group membership is complicated by the fact that there are multiple equations to interpret, similar to the interpretation problem for the role of individual variables in discriminant analysis.  For each logistic regression equation, a set of coefficients, Wald statistics and probability values, and odds ratios are output by SPSS.  The odds ratios are specific to the comparison between each group and the reference group.

 

In addition to the Wald tests for individual coefficients and each pair of groups, SPSS computes a "Likelihood Ratio Test" for each independent variable and the dependent variable.  This is a test of the contribution or effect of each independent variable to the overall model, and is based on the difference in –2 log-likelihood if the variable were removed from the final model.  If an independent variable is not important to the overall model, it will not show a large change in the –2 log-likelihood measure, the chi-square difference will not be statistically significant, and we can conclude that there is no relationship between this independent variable and the dependent variable.  If there is a statistically significant relationship, we can look at the pattern of significance on the individual Wald statistics to interpret the role of the variable in predicting membership in dependent variable categories.  The SPSS manual identifies the "Likelihood ratio test" as more effective in identifying relationships than the Wald statistics for the individual logistic regression equations.

 

Classification

 

We have not spent much time on the mechanics of classification for either discriminant analysis or binary logistic regression because SPSS has done all of the calculations needed for our analysis.  However, while multinomial logistic regression will classify cases, it has no facility for cross-validation or selecting subsets of cases.  When we do the validation analysis, we will have to do all of the commands for splitting the sample, selecting subsets, and computing the logistic regression equations, and classification calculations.  To understand what we will be doing, we will look at an overview of the classification process as it would apply to a three-group problem.  Extensions to problems with larger number of groups can be derived from this discussion.

 

 

In a three group problem, two logistic regression equations are obtained.  For each case in the sample, we can substitute the values of the independent variables and obtain the scores on the logistic regression equations.  These logistic regression scores are the log of the odds of belonging to each group.  The first logistic regression score is the log of the odds of belonging to the first group rather than the third (reference) group.  We will call the first logistic regression score  g1.  The second logistic regression scores is the log of the odds of belonging to the second group rather than the third (reference) group.  We will call the second logistic regression score  g2.  The log of the odds of belonging to the third (reference) group is g3, which is 0 because all of the coefficients for the third equation for the third group are 0 (i.e. 0 + 0 × IV1 + 0 × IV2 + 0 × IV3 + …).

 

The scores g1, g2, and g3 are log estimates of the odds of belonging to each group.  To convert the scores into a probability of group membership, we convert each score into its antilog equivalent with the EXP function and divide by the sum of the three antilog equivalents.  To estimate group membership, we compare the three probabilities, and predict that the subject is a member of the group that has the highest probability.

 

We complete the validation analysis by having SPSS create a crosstabulated table with actual group membership in the rows and predicted group membership in the columns.  If we request that SPSS include total percents in the crosstab table, we can sum the total percents on the main left-to-right diagonal to compute the accuracy rate.

 


Exercise 1:  SPSS Sample Problem for Multinomial Logistic Regression

The dataset for this problem is: Voter.Sav

Exercise 2:  The Personnel Classification Problem in Multinomial Logistic Regression

The dataset for this problem is: ActivityPreferenceInventory.Sav.