Unexpected results and defensive coding

19 June2018

Defensive coding is an important part of Good Programming Practice (GPP) and has been defined as "an approach to programming intended to anticipate future changes of the data that might influence the coding algorithms. Ideally programs should be written in such a way that they will continue to work correctly in case of new or unexpected data values which did not exist at the time the code was developed" [ref]. This is especially important if we develop the code on a subset of data, so the analyses can be generated as soon as the final data become available.

For example, when when selecting a lab parameter we may write the code as "if upcase(param)=RBC" (RBC=red blood cells). We expect the text variable param to be in uppercase, but if the text is in lower case, mixed case or upper case, this code will work because of the use of the upcase function. Or, if we want to select patients who experience adverse events we may write "if upcase(AE)=:'Y'". The use of "=:" ensures that we catch 'Y', 'y', 'yes', 'Yes' and 'YES'. Another very common mistake made by programmers new to SAS is to write e.g. "if age lt 20 then agecat='<20'". This ought to be "if age ne . and age lt 20 then agecat='<20'" because SAS treats missings as negative infinity and we should always allow for the possibility of missing data, even if it is not expected.

Defensive coding requires us to anticipate unexpected results (an oxymoron perhaps). Consider the following code and resulting output:

The code has not worked and the result is unexpected. We should instead merge the data sets first and then in a second data step add the asterisk to sex. Consider another example:

Again, the result is not what we desire. If the variable being amended (in this case var2) exists on the first data set, then we do not have this erroneous result.

Finally, it is important to search the log for the right terms when confirming the program has run without a hitch. I guess most programmers maintain a list of such terms. Personally I search for: error, warning, repeat ("merge statement has more than one data set with repeats of by values"), multiple ("multiple lengths were specified for the variable"), gener ("missing values were generated as a result of performing an operation on missing values"), initial (variable is "unitialized"; note the American spelling).