With Linear Discriminant Analysis (LDA), in general the identification of the classification criterion is not trivial. The classification regions can be disjoint, there can be several regions into which observations are classified, the discrimination can be based upon a multivariate normal distribution, etc. However, when classifying with two univariate normal populations, the identification of the discriminant function is easy. When the inequality is true, we classify the observation, (x), as belonging to population 1. The discrimination criterion is to classify an observation, (x), as belonging to population 1 if

While this is simple, I kept misplacing the piece of paper that I kept it written on and kept re-implementing it in R. This post is mostly a reminder to myself. Here is some R code to do the above:

d <- function(alpha.2,alpha.1,beta.2,beta.1){
return((alpha.2-alpha.1)/(beta.1-beta.2));
}

The reason this is necessary is because SAS doesn't report the discrimination function even when it is possible to report succinctly. SAS does report the (alpha_i) and (beta_i) values, though.

As an example, this is the output from SAS after running PROC DISCRIM on some data with a binary response variable and the POOL=TEST option.

We can use the above to find that the discriminant is (d=frac{-3.96320-(-7.86740)}{1.12389-0.79768}=11.96836). We can use PROC SGPLOT to display this discriminant function:

data work.blog_flat_cross;
set work.blog_flat_cross;
ID=_N_;
Classified = 'A: ' || response || ', P: ' || _INTO_;
run;
/* Displaying the scatterplot without the discriminant line */
PROC SGSCATTER DATA=work.blog_flat_cross;
PLOT xvar*ID / MARKERATTRS=(SYMBOL=CIRCLEFILLED) GROUP=CLASSIFIED;
RUN;
/* Displaying the scatterplot with the discriminant line */
PROC SGPLOT DATA=work.blog_flat_cross;
SCATTER X=ID Y=xvar / MARKERATTRS=(SYMBOL=CIRCLEFILLED) GROUP=CLASSIFIED;
REFLINE 11.96836394 /axis=y lineattrs=(color=black pattern=1); /* value found by hand using output */
RUN;

A plot of the results from PROC DISCRIM with the discriminant line added.

Note that adding a REFLINE requires PROC SGPLOT instead of PROC SGSCATTER. Also, this post was edited because the response "YES" was originally just "YE". The text response was shorter than it should because SAS makes the maximum length of a variable whatever the length of the first value it encounters is. Any values longer than the first value are truncated (see the SAS documentation). To fix this, when declaring the variable in a data step one should manually set the length: length response $ 3; where 3 is the maximum length we want.

A good resource for LDA is:

Johnson, R. A. and D. W. Wichern. 2007. Applied Multivariate Statistical Analysis (6th ed.). Upper Saddle River, NJ, USA: Pearson Prentice Hall.