# Tag: source code

The other day I was looking for a package that did the Quadrant Count Ratio (QCR) in R. I couldn't find one, so I whipped up some simple code to do what I needed to do.

qcr <- function(dat){
n <- nrow(dat);
m.x <- mean(dat[,1]); m.y <- mean(dat[,2]);
# in QCR we ignore points that are on the mean lines
# number of points in Quadrants 1 and 3
q13 <- sum(dat[,1] > mean(dat[,1]) & dat[,2] > mean(dat[,2]))+sum(dat[,1] < mean(dat[,1]) & dat[,2] < mean(dat[,2]))
# number of points in Quadrants 2 and 4
q24 <- sum(dat[,1] < mean(dat[,1]) & dat[,2] > mean(dat[,2]))+sum(dat[,1] < mean(dat[,1]) & dat[,2] > mean(dat[,2]))
return((q13-q24)/n)
}


The above assumes dat is an Nx2 array with column 1 serving as X and column 2 serving as Y. This can easily be changed. I also wrote a little function to plot the mean lines:

plot.qcr <- function(dat){
value <- qcr(dat);
plot(dat);
abline(v=mean(dat[,1]),col="blue"); # adds a line for x mean
abline(h=mean(dat[,2]),col="red"); # adds a line for y mean
}


Both of these functions are simple, but I will likely extend and polish them (and then release them as a package). I'd also like to explore what would happen to the QCR if median lines were used instead of mean lines. (This new QCR* would no longer directly motivate Pearson's Product-Moment Correlation, but could have its own set of advantages.) Below is a quick example:

# QCR example
set.seed(1)
dat.x <- c(1:10)
dat.y <- rbinom(10,10,.5)
dat <- cbind(dat.x,dat.y)
qcr(dat)
# [1] 0.6
plot.qcr(dat)


This is the plot:

With Linear Discriminant Analysis (LDA), in general the identification of the classification criterion is not trivial. The classification regions can be disjoint, there can be several regions into which observations are classified, the discrimination can be based upon a multivariate normal distribution, etc. However, when classifying with two univariate normal populations, the identification of the discriminant function is easy. When the inequality $(\alpha_1+\beta_1 x \gt \alpha_2 + \beta_2 x)$ is true, we classify the observation, (x), as belonging to population 1. The discrimination criterion is to classify an observation, (x), as belonging to population 1 if

While this is simple, I kept misplacing the piece of paper that I kept it written on and kept re-implementing it in R. This post is mostly a reminder to myself. Here is some R code to do the above:

d <- function(alpha.2,alpha.1,beta.2,beta.1){
return((alpha.2-alpha.1)/(beta.1-beta.2));
}


The reason this is necessary is because SAS doesn't report the discrimination function even when it is possible to report succinctly. SAS does report the (alpha_i) and (beta_i) values, though.

As an example, this is the output from SAS after running PROC DISCRIM on some data with a binary response variable and the POOL=TEST option.

We can use the above to find that the discriminant is (d=frac{-3.96320-(-7.86740)}{1.12389-0.79768}=11.96836). We can use PROC SGPLOT to display this discriminant function:

data work.blog_flat_cross;
set work.blog_flat_cross;
ID=_N_;
Classified = 'A: ' || response || ', P: ' || _INTO_;
run;

/* Displaying the scatterplot without the discriminant line */
PROC SGSCATTER DATA=work.blog_flat_cross;
PLOT xvar*ID / MARKERATTRS=(SYMBOL=CIRCLEFILLED) GROUP=CLASSIFIED;
RUN;

/* Displaying the scatterplot with the discriminant line */
PROC SGPLOT DATA=work.blog_flat_cross;
SCATTER X=ID Y=xvar / MARKERATTRS=(SYMBOL=CIRCLEFILLED) GROUP=CLASSIFIED;
REFLINE 11.96836394 /axis=y lineattrs=(color=black pattern=1); /* value found by hand using output */
RUN;


A plot of the results from PROC DISCRIM with the discriminant line added.

Note that adding a REFLINE requires PROC SGPLOT instead of PROC SGSCATTER. Also, this post was edited because the response "YES" was originally just "YE". The text response was shorter than it should because SAS makes the maximum length of a variable whatever the length of the first value it encounters is. Any values longer than the first value are truncated (see the SAS documentation). To fix this, when declaring the variable in a data step one should manually set the length: length response \$ 3; where 3 is the maximum length we want.

A good resource for LDA is:

Johnson, R. A. and D. W. Wichern. 2007. Applied Multivariate Statistical Analysis (6th ed.). Upper Saddle River, NJ, USA: Pearson Prentice Hall.

I often use R to do analyses, and I've found that presentation of output can be... less than ideal. When I was in a hurry, I would do something like this:

> print("test.var=")
[1] "test.var="
> print(round(test.var,3))
[1] 11.797 11.247 12.582  7.575  9.427


This isn't pretty, but it is quick. Not satisfied with quick and dirty, I wrote this function:

pprint <- function(..., sep="", innersep=" ", outersep1="'", outersep2="'"){
toPrint <- c();
for (ii in 1:length(list(...))){
# print a normal object like text (e.g. "testing ")
if (length(list(...)[[ii]]) == 1){
toPrint <- paste(toPrint, list(...)[[ii]], sep=sep);
}
# printing a vector (e.g. x <- c(1:10))
else if (length(list(...)[[ii]]) > 1) {
tempString <- paste(outersep1, list(...)[[ii]][1], sep="");
for (jj in 2:length(list(...)[[ii]]))
tempString <- paste(tempString, list(...)[[ii]][jj], sep=innersep);
tempString <- paste(tempString, outersep2, sep="");
toPrint <- paste(toPrint, tempString, sep="");
}
}
print(toPrint);
}


pprint is basically a function which combines print and paste but with some options that allow for easier presentation. For example:

> pprint("test.var=",round(test.var,3))
[1] "test.var='11.797 11.247 12.582 7.575 9.427'"
> pprint("test.var=",round(test.var,3),innersep=", ")
[1] "test.var='11.797, 11.247, 12.582, 7.575, 9.427'"


I use this usually when I'm formatting output to turn in for a homework assignment. If you have any questions/comments/improvements, let me know.