Two-Group Illustrative Example of Discriminant Analysis

Overview of Discriminant Analysis

 

There are many occasions when the dependent variable in our analysis is a categorical variable such as type of client, problem, treatment, organization, diagnostic category, or outcome group. If there are two categories in the group, we have a choice of logistic regression or discriminant analysis. If there are more than two categories in the group, the appropriate analytic technique until recently has been discriminant analysis.  An alternative to discriminant analysis with three or more groups is multinomial logistic regression, which we will consider in the last class of the semester.

 

There are two generic types of discriminant analysis: descriptive discriminant analysis and predictive discriminant analysis. The goal of descriptive discriminant analysis is to identify the independent variables that have a strong relationship to group membership in the categories of the dependent variable. This component or stage of discriminant analysis may be referred to as deriving the discriminant functions. The goal of predictive discriminant analysis is to use the relationships in the discriminant functions to build a valid and accurate predictive model. This may be referred to as the classification phase or stage of discriminant analysis.

 

While we will rely on the classification results to assess the overall fit of the model, our use of discriminant analysis will be for descriptive studies.  The text presents a substantial amount of material on the topic of reducing classification errors on which we will not spend any time.

 

Like multiple regression and logistic regression, the relationship between the dependent and independent variable is expressed in an equation or a function. Unlike regression analysis, discriminant analysis may produce multiple functions, which together distinguish among categories of the dependent variable. Each discriminant function produces a discriminant score. The pattern of the cases on the discriminant scores is used to estimate which group of the dependent variable a case belongs to.

 

Generally, the number of discriminant functions is one less than the number of groups in the dependent variable, unless there are fewer independent variables than groups, in which case, the number of functions is bounded by the number of independent variables. Some, all, or none of the discriminant functions may be statistically significant in a particular problem, depending on how well we can distinguish among the groups.

 

In multiple regression, the function was derived to satisfy the mathematical property of minimizing the residual variance in the dependent variable. In discriminant analysis, the functions are derived to maximize the between groups variance relative to the within groups variance. Another way to think of this is that discriminant analysis maximizes the statistical distance between the means, or centroids (set of means of several variables), of the groups on the set of independent variables. Maximizing this distance between group means should enhance our ability to estimate which group a case belongs to because the distinction between groups is more clearly defined.

 

The process of translating the independent variables from the original coordinate system to the coordinate system of discriminant space uses the mathematical procedure of finding characteristic roots or eigenvalues. The matrix that is used to multiply the original scores to convert them to discriminant scores is referred to as the eigenvectors. It is not absolutely necessary that we understand the mathematical process for deriving this translation in coordinate systems to make use of it. What is necessary to remember is that this mathematical process translates the coordinate dimensions of our original problem (one coordinate or axis for each independent variables) to the reduced dimensionality of discriminant space, in which the largest eigenvalue is associated with the first dimension of the reduced space, the second largest eigenvalue is associated with the second dimension of the reduced space, etc. The translation from one coordinate system to the other is mathematically exact and precise, and changes the form of the information, but not the content.

 

The classification phase of discriminant analysis is analogous to the process of comparing individual cases to a group mean using standard scores or z-scores. For each group in the dependent variable, we calculate the mean and variance in discriminant space. We then convert the independent variables for an individual case to discriminant space. We compute the statistical distance that the case is from the group mean in standard score units. We guess or predict that the case is a member of the group that corresponds to the smallest distance between group mean and individual scores.

 

We can compare the predicted group memberships to known group memberships to derive an accuracy measure, or "hit" rate, for a discriminant model. The accuracy rate for a model is notoriously inflated, or overfitted, when the same cases are used in deriving the functions, making holdout testing a necessity.  SPSS provides us with a one-at-a-time hold out method. This method is computed by sequentially holding-out one case from the analysis and using the remaining cases to derive the discriminant functions used to classify the case. This method is repeated for all cases in the analysis and the resulting model accuracy is usually regarded as a less biased measure of model accuracy. To this calculation, we will add our usual split-half validation analysis.

 

A more detailed presentation of all of the statistics and processes incorporated in discriminant analysis are presented in the text.  I will follow the text in working the two group and three group sample problems, in that we will extract a holdout sample from the very start of the analysis, and do the discriminant analysis on the cases that were not in the holdout sample.  However, when we work problems thereafter, we will conduct the discriminant analysis on the entire sample, and do the holdout analysis in stage 6 when we address the issue of validation.