# Category: Statistics

Familiarity with statistical computing software - particularly programs as flexible and feature-filled as R and the packages on CRAN - has been a tremendous boon. However, this familiarity has sent me searching the web for a way ask for particular output that is not printed by default. This expectation that the output I want from software is available with the right option or command has led me (more than once) to forget the possibility of simply computing the required output manually.

In particular, I recently needed to compute the RMSEA of the null model for confirmatory factor analysis (CFA). A few months ago, I chose to use Mplus for the CFA because I was familiar with it (moreso than the lavaan R package at least) and it had some estimation methods I needed that other software does not always have implemented (e.g. the WLSMV estimator is not available in JMP 10 with SAS PROC CALIS).

Mplus does not print the RMSEA for the null model (or baseline model, in Mplus parlance) in the standard output, nor does there seem to be an command to request it. Fortunately, this is not an insurmountable problem because the formula for RMSEA is straightforward:

where $X^2$ is the observed Chi-Square test statistic, df is the associated degrees of freedom, and N is the sample size. In the Mplus output, look for "Chi-Square Test of Model Fit for the Baseline Model" for the $X^2$ and df values.

The reason for needing to check the null RMSEA is that incremental fit indices such as CFI and TLI may not be informative if the null RMSEA is less than 0.158 (Kenny, 2014). If you are using the lavaan package, it appears this can be calculated using the nullRMSEA function in the semTools package.

(As an aside, don't let the vintage '90s design fool you: David Kenny's website is a great resource for structural equation modeling. He has the credentials to back up the site, too: Kenny is a Distinguished Professor Emeritus at the University of Connecticut.)

References

While I was working on my master's degree I took an introductory course in Monte Carlo methods (STA 6866 at UF). As is typical of grad school, a final project was required - in this case something related to a Monte Carlo method not covered in the course (we used Introduction to Monte Carlo Methods in R by Robert and Casella). I chose to do a brief report on and implementation of Monte Carlo Tree Search (MCTS).

The purpose of this post is not to introduce MCTS, but rather to post what I did for the project in case others find it useful. MCTS is a method that can be used to create an artificial intelligence (AI) for games (both video games and traditional games). As a proof-of-concept, I implemented MCTS as an AI for a game of Connect Four. (It was really 'Connect Three' on a 4x4 board due to computational constraints, but the concept remains the same and the code should generalize easily.) Connect Four is not an ideal use for the MCTS AI because it is a solved game (unlike Go or Chess). I chose to use Connect Four because it is easy to code in R, easy to play, and the purpose is to demonstrate the MCTS method for AI and not to create a truly competitive AI.

The relevant files are:

• An R file that explains how to use the game and AI (OPEN-ME.R)
• R code for the Connect Three/Four/K game (can choose Player 1 vs. Player 2, Player 1 vs. AI, or AI vs. AI) (connect-game-code.R)
• R code for the MCTS AI (mcts-ai-avec-block.R contains deterministic code to make the AI play defensively if Player 1 can win in one move to make play more realistic; mcts-ai-sans-block.R does not contain the blocking code)
• R code for a 'naive' AI (selects from the available moves without doing any weighting, coded for comparison purposes) (naive-ai.R)
• A helper file previously detailed (pprint.R)
• A file containing 25000 simulated games (giving this to the MCTS AI improves its performance) (BLR-4x4-AI1-24k.txt)
• The report I wrote about MCTS for the class (pdf)

The report I wrote for class can be download here, and all of the other files are in one zip archive. If you found this useful, I'd love to hear about it. If you want me to do something else with this, also let me know.

Update - December 8, 2014

It looks like the statistics department finally took down my old homepage. I've uploaded the files to this website instead. It may be worth repeating that this work was work done for a class, and, while I enjoyed doing it and learned a lot, I make no guarantees about the code or report.

In early October 2012, the media started running stories about how eating many servings of fruits and vegetables is linked with happiness. The study in question was an observational study of the eating habits of 80,000 Britons and did find that, controlling for other socio-economic variables, high levels of happiness were associated with eating 7-8 servings (2.8oz) of fruits and vegetables per day (Blanchflower, Oswald, & Stewart-Brown, 2012). The study made it clear (in both the abstract and text) that because of its observational nature causality could not be determined:

• "Reverse causality and problems of confounding remain possible."
• "This implies that, as in some other parts of the well-being literature, we cannot draw firm inferences about causality."
• "... with caveats about the lack here of clinching causal evidence..."
• "... it is sensible to emphasize, first, the need for extreme caution in the interpretation of this study’s findings..."

The authors repeatedly made appropriate statements about the interpretability of the study  and the potential for future controlled studies to determine causality — exactly what one should do. The importance of this work is not diminished because of its observational nature, and serves to fill a gap in the well-being literature and suggest areas for future research.

But then the media happened.

Of course, some sources got the story right:

References

• Blanchflower, David G., Oswald, Andrew J., and Stewart-Brown, Sarah (2012). Is Psychological Well-Being Linked to the Consumption of Fruit and Vegetables? Social Indicators Research(in press).

There are presently no shortage of articles being written about Hurricane (now Post-Tropical Cyclone) Sandy and what role climate change has played in the formation of the "Frankenstorm" that has left over 7.5 million people without power. I came across one article, however, that should be addressed: Hurricane Sandy: The Worst-Case Scenario For New York City Is Unimaginable (by Mike Tidwell, at ThinkProgress.org).

Much of the article is a traditional "worst-case scenario" description found commonly in the media for seemingly all combinations of disasters and major cities. Of note, however, is the following paragraph (emphasis added):

Another major storm struck in 1892, then another in 1938 when the borderline Category 4 “Long Island Express” passed through the outskirts of greater New York, inflicting widespread death and destruction across New York state, New Jersey and much of New England. But that storm, 68 years ago, was the last major hurricane (Category 3 or above) to strike the New York Metropolitan region. It’s now a matter of when, not if, a big hurricane will strike again, according to meteorologists. And history says “when” is very soon.

The bolded statement is a classic example of the gambler's fallacy. It seems as if hurricanes can be reasonably modeled by a Poisson process, a useful probability model (e.g. this website, this paper (pdf), and this paper (pdf)), and this assumption will be used throughout this post.

One of the important properties associated with this model is called memorylessness. Essentially, what this means is that at any given time period, the probability of an event occurring does not depend on the history up to that point. That is, it doesn't matter how long a system has been running, a 'success' is just as likely to occur at this time period if this is the first time period, if this is the 10th time period, or if this is the 76th time period.

For example, when playing roulette, the probability of spinning a 'black' are the same on every spin, irrespective of how many times you have previously spun the wheel. Even if there have been 999 'red' in a row, we are not 'due' for a black spin (assuming the wheel is fair, etc.).

Just because New York City hasn't had a major hurricane for a while does not mean that it is due or overdue for one. Each hurricane season is a new cycle and does not remember the previous years' hurricanes.

I am not an expert in the fields of research associated with most disasters, but this makes sense in terms of hurricanes (both from various sources and my life as a Floridian). For other disasters (e.g. volcanoes and earthquakes), my understanding is that pressure builds over time and therefore reasonable models for those systems should not have the memorylessness property, so perhaps talking about 'overdue' for a major disaster is warranted. Maybe not, but I do want to emphasize that this is specific to hurricanes.

(It might also be worth checking out these comics related to the gambler's fallacy.)

The other day I was looking for a package that did the Quadrant Count Ratio (QCR) in R. I couldn't find one, so I whipped up some simple code to do what I needed to do.

qcr <- function(dat){
n <- nrow(dat);
m.x <- mean(dat[,1]); m.y <- mean(dat[,2]);
# in QCR we ignore points that are on the mean lines
# number of points in Quadrants 1 and 3
q13 <- sum(dat[,1] > mean(dat[,1]) & dat[,2] > mean(dat[,2]))+sum(dat[,1] < mean(dat[,1]) & dat[,2] < mean(dat[,2]))
# number of points in Quadrants 2 and 4
q24 <- sum(dat[,1] < mean(dat[,1]) & dat[,2] > mean(dat[,2]))+sum(dat[,1] < mean(dat[,1]) & dat[,2] > mean(dat[,2]))
return((q13-q24)/n)
}


The above assumes dat is an Nx2 array with column 1 serving as X and column 2 serving as Y. This can easily be changed. I also wrote a little function to plot the mean lines:

plot.qcr <- function(dat){
value <- qcr(dat);
plot(dat);
abline(v=mean(dat[,1]),col="blue"); # adds a line for x mean
abline(h=mean(dat[,2]),col="red"); # adds a line for y mean
}


Both of these functions are simple, but I will likely extend and polish them (and then release them as a package). I'd also like to explore what would happen to the QCR if median lines were used instead of mean lines. (This new QCR* would no longer directly motivate Pearson's Product-Moment Correlation, but could have its own set of advantages.) Below is a quick example:

# QCR example
set.seed(1)
dat.x <- c(1:10)
dat.y <- rbinom(10,10,.5)
dat <- cbind(dat.x,dat.y)
qcr(dat)
# [1] 0.6
plot.qcr(dat)


This is the plot: