Month: September 2012

While looking for more probability and statistics comics, I went to one of my favorite webcomics: Dinosaur Comics. A handy resource for finding individual Dinosaur Comics is this XML file which contains all the text of every comic (I think). Ctrl+F is the way to go for finding comics of interest. While there are numerous comics that mention statistics, the relationship is usually tangential at best.

you can show this comic to chronic gamblers and they will PUNCH YOU IN THE NECK
"gambling makes you appear more attractive in the eyes of women" - Copyright 2004 by Ryan North, Dinosaur Comics
keywords: probability; Gambler's Fallacy; dice; luck;

On an unrelated note, Dinosaur Comics has one of the best alternate URLs in existence: chewbac.ca. Food for thought.

It amazes me how many people are amazed when I show them this little trick. Holding the "Alt" key while clicking and dragging allows one to select a rectangular region in word. When using a monospace font (fixed-width fonts like Consolas or Courier New), this can be used to quickly delete columns of text. Best results are achieved using monospace fonts, but this tool has helped me many times in other situations. An example of a rectangular region selected with this technique is below.

Sometimes I will launch Word just to use this tool and go back to editing files in another program. I learned this back in high school in what I thought would be a gimmicky course on Microsoft Office. In that course I learned some pretty useful skills that, while not making me an expert, have served me well. It amazes me how many people have taken courses centered around Microsoft Office and still seem perplexed by it.

With Linear Discriminant Analysis (LDA), in general the identification of the classification criterion is not trivial. The classification regions can be disjoint, there can be several regions into which observations are classified, the discrimination can be based upon a multivariate normal distribution, etc. However, when classifying with two univariate normal populations, the identification of the discriminant function is easy. When the inequality $(\alpha_1+\beta_1 x \gt \alpha_2 + \beta_2 x)$ is true, we classify the observation, (x), as belonging to population 1. The discrimination criterion is to classify an observation, (x), as belonging to population 1 if

While this is simple, I kept misplacing the piece of paper that I kept it written on and kept re-implementing it in R. This post is mostly a reminder to myself. Here is some R code to do the above:

d <- function(alpha.2,alpha.1,beta.2,beta.1){
return((alpha.2-alpha.1)/(beta.1-beta.2));
}

The reason this is necessary is because SAS doesn't report the discrimination function even when it is possible to report succinctly. SAS does report the (alpha_i) and (beta_i) values, though.

As an example, this is the output from SAS after running PROC DISCRIM on some data with a binary response variable and the POOL=TEST option.

We can use the above to find that the discriminant is (d=frac{-3.96320-(-7.86740)}{1.12389-0.79768}=11.96836). We can use PROC SGPLOT to display this discriminant function:

data work.blog_flat_cross;
set work.blog_flat_cross;
ID=_N_;
Classified = 'A: ' || response || ', P: ' || _INTO_;
run;

/* Displaying the scatterplot without the discriminant line */
PROC SGSCATTER DATA=work.blog_flat_cross;
PLOT xvar*ID / MARKERATTRS=(SYMBOL=CIRCLEFILLED) GROUP=CLASSIFIED;
RUN;

/* Displaying the scatterplot with the discriminant line */
PROC SGPLOT DATA=work.blog_flat_cross;
SCATTER X=ID Y=xvar / MARKERATTRS=(SYMBOL=CIRCLEFILLED) GROUP=CLASSIFIED;
REFLINE 11.96836394 /axis=y lineattrs=(color=black pattern=1); /* value found by hand using output */
RUN;

A plot of the results from PROC DISCRIM with the discriminant line added.

Note that adding a REFLINE requires PROC SGPLOT instead of PROC SGSCATTER. Also, this post was edited because the response "YES" was originally just "YE". The text response was shorter than it should because SAS makes the maximum length of a variable whatever the length of the first value it encounters is. Any values longer than the first value are truncated (see the SAS documentation). To fix this, when declaring the variable in a data step one should manually set the length: length response \$ 3; where 3 is the maximum length we want.

A good resource for LDA is:

Johnson, R. A. and D. W. Wichern. 2007. Applied Multivariate Statistical Analysis (6th ed.). Upper Saddle River, NJ, USA: Pearson Prentice Hall.

There are tons of examples of statistics being misused or misrepresented by the media and politicians (among others). I'm going to try and collect examples as I find them so that I don't have to search for them at the last minute.

I saw this image floating around the internet (not sure who originally took the screenshot):

This graph is misleading because it exaggerates the resulting increase in tax rate by not showing having the Y-axis display zero. Displaying the graph only from 34% and up makes the tax rate after January 1, 2013 to be 5.6 times what the rate currently is. In fact, the actual tax rate after January 1, 2013 is about 1.13 times what it currently is if the Bush-era tax cuts are allowed to expire. This is what a more accurate graph would look like:

Disturbingly, when I went to make the above in LibreOffice Calc the default scale was 32-40% (same in Microsoft Excel). While there are times when displaying zero is not necessary, not including zero magnifies the relative differences among the categories (Lemon & Tyagi, 2009). Kozak (2011) gives some recommendations about when including and excluding zero is appropriate; essentially, if zero is meaningful in the context of the data it should be included.

• Kozak, M. (2011). When should zero be included on a scale showing magnitude? Teaching Statistics, 33(2), 53–58. (link to abstract)
• Lemon, J. and Tyagi, A. (2009). The fan plot: A technique for displaying relative quantities and differences. Statistical Computing and Graphics Newsletter, 20(1), 8–10. (link to full article) [Note: I don't know if this article is peer-reviewed, but it is a publication of the ASA, so there is some weight to it.]

I can't believe I forgot to include this one in the statistics comic master post considering I've used it on an exam before.

Hell, my eighth grade science class managed to conclusively reject it just based on a classroom experiment. It's pretty sad to hear about million-dollar research teams who can't even manage that.
"Null Hypothesis" - Copyright CC BY-NC 2.5 by Randall Munroe, xkcd.com
keywords: hypothesis testing; null hypothesis; study; reject;