Press "Enter" to skip to content

Tag: SAS

Fixing truncated strings in SAS

SAS, as powerful as it is, behaves oddly at times. The behavior is usually very well-documented, so it is just something that one has to account for. When SAS is creating a string variable, the maximum length it chooses is the length of the first value it encounters: any string longer than the first value is truncated (the end letters are just ignored). For example, if 'CAfter' and 'EBefore' are values, and 'CAfter' is encountered first, then the maximum value for all strings will be 6 and 'EBefore' will be shortened to 'EBefor'. If 'EBefore' were encountered first, there wouldn't be a problem. This can sometimes prove tricky if the data is read in unsorted once and then sorted later.

To correct this, one should manually specify the length the variable should be when it is created, like so:


data ling.vowel;
 set ling.vowel;
 length Group $ 7;
 Group = Class || When;
 obs = _N_;
run;

In the above, the length of the variable Group is fixed at 7. (Group is created by concatenating the variables Class and When).

Sometimes, the truncation is caused instead when the data are imported. By default, SAS only looks at the first 20 observations in a datasetto determine the maximum length of string variables. This can be changed by adding a guessingrows=32767 statement to PROC import (32767 is the maximum value):


proc import datafile="&mypath/vot-data-0921.csv" out=ling.vot dbms=csv replace;
 getnames=yes;
 guessingrows=32767;
run;

These are the two solutions I've found for truncated strings, and I was tired of having to find the SAS files that I had used them in.

Leave a Comment

A gripe with the default SAS GUI

SAS is a great piece of software. Really, it is. As far as getting statistics done (and done well) it is in a league of its own. However, the user interface leaves a lot to be desired.

The little icon above the green arrow? Run all code (or run selected code).
The little icon above the red arrow? Delete all code and close the active file without saving with no confirmation.

Two small, black, similarly-shaped icons. One I use dozens of times in each session. One which I cannot think of a use for. And they put them right. Next. To. Each. Other.

Admittedly, this UI is not the latest and greatest out of North Carolina. SAS products ship with "Enterprise Guide" - a modern, workflow-oriented IDE for SAS. Enterprise Guide has incredible features and is thoroughly modern... but it is (in my experience) slow and overkill for many things. I just want to write some SAS code, see syntax highlighting, and run my code without worrying that I'll accidentally delete what I've been working on. I learned to use SAS through batch jobs on a UNIX system, so any UI is more friendly than the command line... but there has to be a middle ground somewhere.

Leave a Comment

Identifying the discriminant from LDA

With Linear Discriminant Analysis (LDA), in general the identification of the classification criterion is not trivial. The classification regions can be disjoint, there can be several regions into which observations are classified, the discrimination can be based upon a multivariate normal distribution, etc. However, when classifying with two univariate normal populations, the identification of the discriminant function is easy. When the inequality (\alpha_1+\beta_1 x \gt \alpha_2 + \beta_2 x) is true, we classify the observation, (x), as belonging to population 1. The discrimination criterion is to classify an observation, (x), as belonging to population 1 if

 x \gt \frac{\alpha_2-\alpha_1}{\beta_1-\beta_2}.

While this is simple, I kept misplacing the piece of paper that I kept it written on and kept re-implementing it in R. This post is mostly a reminder to myself. Here is some R code to do the above:

d <- function(alpha.2,alpha.1,beta.2,beta.1){
 return((alpha.2-alpha.1)/(beta.1-beta.2));
 }

The reason this is necessary is because SAS doesn't report the discrimination function even when it is possible to report succinctly. SAS does report the (alpha_i) and (beta_i) values, though.

As an example, this is the output from SAS after running PROC DISCRIM on some data with a binary response variable and the POOL=TEST option.

We can use the above to find that the discriminant is (d=frac{-3.96320-(-7.86740)}{1.12389-0.79768}=11.96836). We can use PROC SGPLOT to display this discriminant function:

data work.blog_flat_cross;
 set work.blog_flat_cross;
 ID=_N_;
 Classified = 'A: ' || response || ', P: ' || _INTO_;
run;

/* Displaying the scatterplot without the discriminant line */
PROC SGSCATTER DATA=work.blog_flat_cross;
 PLOT xvar*ID / MARKERATTRS=(SYMBOL=CIRCLEFILLED) GROUP=CLASSIFIED;
RUN;

/* Displaying the scatterplot with the discriminant line */
PROC SGPLOT DATA=work.blog_flat_cross;
 SCATTER X=ID Y=xvar / MARKERATTRS=(SYMBOL=CIRCLEFILLED) GROUP=CLASSIFIED;
 REFLINE 11.96836394 /axis=y lineattrs=(color=black pattern=1); /* value found by hand using output */
RUN;

A plot of the results from PROC DISCRIM with the discriminant line added.

Note that adding a REFLINE requires PROC SGPLOT instead of PROC SGSCATTER. Also, this post was edited because the response "YES" was originally just "YE". The text response was shorter than it should because SAS makes the maximum length of a variable whatever the length of the first value it encounters is. Any values longer than the first value are truncated (see the SAS documentation). To fix this, when declaring the variable in a data step one should manually set the length: length response $ 3; where 3 is the maximum length we want.

A good resource for LDA is:

Johnson, R. A. and D. W. Wichern. 2007. Applied Multivariate Statistical Analysis (6th ed.). Upper Saddle River, NJ, USA: Pearson Prentice Hall.

Leave a Comment