# Tag: statistics

Familiarity with statistical computing software - particularly programs as flexible and feature-filled as R and the packages on CRAN - has been a tremendous boon. However, this familiarity has sent me searching the web for a way ask for particular output that is not printed by default. This expectation that the output I want from software is available with the right option or command has led me (more than once) to forget the possibility of simply computing the required output manually.

In particular, I recently needed to compute the RMSEA of the null model for confirmatory factor analysis (CFA). A few months ago, I chose to use Mplus for the CFA because I was familiar with it (moreso than the lavaan R package at least) and it had some estimation methods I needed that other software does not always have implemented (e.g. the WLSMV estimator is not available in JMP 10 with SAS PROC CALIS).

Mplus does not print the RMSEA for the null model (or baseline model, in Mplus parlance) in the standard output, nor does there seem to be an command to request it. Fortunately, this is not an insurmountable problem because the formula for RMSEA is straightforward:

where $X^2$ is the observed Chi-Square test statistic, df is the associated degrees of freedom, and N is the sample size. In the Mplus output, look for "Chi-Square Test of Model Fit for the Baseline Model" for the $X^2$ and df values.

The reason for needing to check the null RMSEA is that incremental fit indices such as CFI and TLI may not be informative if the null RMSEA is less than 0.158 (Kenny, 2014). If you are using the lavaan package, it appears this can be calculated using the nullRMSEA function in the semTools package.

(As an aside, don't let the vintage '90s design fool you: David Kenny's website is a great resource for structural equation modeling. He has the credentials to back up the site, too: Kenny is a Distinguished Professor Emeritus at the University of Connecticut.)

References

Kenny, D. A. (2014). Measuring Model Fit. Retrieved from http://davidakenny.net/cm/fit.htm

I spotted a dot plot while watching TV the other day:

It isn't too frequently that one sees a dot plot on TV, so this is a good opportunity to discuss something students might have encountered. Using this commercial might be a worthwhile topic of discussion in a statistics lesson.

The apparently constructed the dot plot by asking 400 people "How old is the oldest person you've known?" A few more details can be gleaned from the Prudential website and a "behind the scenes" video that was shot.

A few things that can be discussed with students come to mind:

• What can we actually conclude from the dot plot?
• The description of the YouTube video describes this as an "experiment" (as does the narrator in the behind the scenes video). Is this really an experiment?
• What do we know about the sample?
• What happens as we get older in terms of the oldest person "[we]'ve known"? (Children and adults with a wide range of ages are asked to place a sticker.)

There's a 30 second version of the advertisement, too.

I saw this graphic reblogged by NPR on Tumblr (originally posted by Luminous Enchiladas, though I can't be sure of the creator), and I must say that it is impressive.

Olympics vs Mars

There are some pretty substantial problems with this impressively bad graphic.

• Pie charts should only be used when comparing parts to a whole. The $17.5 billion dollars that went to the Olympics and the Curiosity Rover wasn't a priori some whole amount of money. Treating it as "the whole" implies that there was only$17.5 billion dollars from wherever to be spent, and that it was spent only on the Olympics and Mars.
• The pieces of the pie chart aren't labeled with the dollar amounts. Instead, the pieces are labeled with the piece's name which does address a complaint with pie charts (namely that the reader needs to continually look back and forth from the chart to the key). Because there are only two pieces, there is room for including the dollar figures in the chart area. With more complicated charts, this wouldn't be the case.
• This chart uses an unnecessary "3D" effect which obscures the true areas being compared. A flat pie chart would be less misleading.

Additionally, there are some general problems with pie charts which make them inferior to other charts (specifically bar charts):

• Comparing areas is difficult. Cleveland (1985) writes about how area comparisons are subject to bias, and Schmid (1983) specifically describes how, when comparing two circles (e.g. two pie charts of different size used to indicate change over time), the area of the larger circle is underestimated relative to the smaller.
• Comparing angles is difficult. Cleveland (1985) states that ordering the sections of a pie chart is prone to error based on earlier empirical research.

References:

• Cleveland, W. S. (1985). The elements of graphing data. Monterey, Calif: Wadsworth Advanced Books and Software.
• Schmid, C. F. (1983). Statistical graphics: Design principles and practices. New York: Wiley.

The other day I was looking for a package that did the Quadrant Count Ratio (QCR) in R. I couldn't find one, so I whipped up some simple code to do what I needed to do.

qcr <- function(dat){
n <- nrow(dat);
m.x <- mean(dat[,1]); m.y <- mean(dat[,2]);
# in QCR we ignore points that are on the mean lines
# number of points in Quadrants 1 and 3
q13 <- sum(dat[,1] > mean(dat[,1]) & dat[,2] > mean(dat[,2]))+sum(dat[,1] < mean(dat[,1]) & dat[,2] < mean(dat[,2]))
# number of points in Quadrants 2 and 4
q24 <- sum(dat[,1] < mean(dat[,1]) & dat[,2] > mean(dat[,2]))+sum(dat[,1] < mean(dat[,1]) & dat[,2] > mean(dat[,2]))
return((q13-q24)/n)
}

The above assumes dat is an Nx2 array with column 1 serving as X and column 2 serving as Y. This can easily be changed. I also wrote a little function to plot the mean lines:

plot.qcr <- function(dat){
value <- qcr(dat);
plot(dat);
abline(v=mean(dat[,1]),col="blue"); # adds a line for x mean
abline(h=mean(dat[,2]),col="red"); # adds a line for y mean
}

Both of these functions are simple, but I will likely extend and polish them (and then release them as a package). I'd also like to explore what would happen to the QCR if median lines were used instead of mean lines. (This new QCR* would no longer directly motivate Pearson's Product-Moment Correlation, but could have its own set of advantages.) Below is a quick example:

# QCR example
set.seed(1)
dat.x <- c(1:10)
dat.y <- rbinom(10,10,.5)
dat <- cbind(dat.x,dat.y)
qcr(dat)
# [1] 0.6
plot.qcr(dat)

This is the plot:

For more information on the QCR check out this article: Holmes, Peter (2001). “Correlation: From Picture to Formula,” Teaching Statistics, 23(3):67–70.

[Updated on 2013-10-03 to add a link to Wikipedia.]

Ever on the quest for good comics related to statistics, I've started searching for specific terms. Of course, xkcd doesn't fail to deliver with a comic related to Markov chains.

Freestyle rapping is basically applied Markov chains.
"90's Flowchart" - Copyright CC BY-NC 2.5 by Randall Munroe, xkcd.com
keywords: flowchart; Markov chain; probability;