• Paul Brown, Director

Back-coding in R instead of SPSS.

If you are an SPSS user, you would be used to variables having values and value labels for both ordinal and categorical variables (i.e., for all types of factor variables), yielding an implicit ordering of the categories. Whilst maintaining an ordering does not really make sense for categorical variables, this does have some advantages in the set-up and analysis of survey data. For it allows for the addition of new labels into a variable without changing the existing implicit ordering. This is particularly useful when a survey is being ‘back-coded’. Take a survey where analysis is to be conducted by brand. In this case, most brands are ‘pre-coded’ with a sequential brand id, then the ‘Other’ and ‘Don’t know’ categories are given much larger brand ids than the rest. This is usually to allow additional brands to be added into the survey at a later date – either pre or post survey closure - whilst still allowing the separation of actual brands from non-brands (so these can be then rooted at the foot of tables ).

E.g., suppose there were three brands in the original survey, with the additional possibilities of ‘Other’ or ‘Don’t know’; if ‘Other’ was chosen, the respondent could write in their choice of brand. After the survey was closed, these verbatims were examined and it was determined that two more brands should be included, i.e., those respondents who wrote Brand D or Brand E were given their own (new) brand id, in the table below.

Next, a .csv file was produced with a respondent ID and back-code for those who replied ‘Other’. It was then required to merge this file into the original data file, to create a new brand variable in order to conduct the analysis.

In all packages, it is relatively simple to do the merging and matching by respondent ID. Moreover, in SPSS (and others), it is easy to create the new brand variable from the old one using using the additional brand ids simply inserted as new value labels into the original list.

This is not necessarily the case in other packages. For, if the original brand variable is treated as a genuine categorical variable (unordered factor), there are no brand ids, only labels - and these are often text. So, you cannot simply add new values and associated labels into such variables.

We ran into this problem whilst using R, which uses the factor (categorical) or the ordered data type (for ordinal variables) for such analysis.

In the example above, we first created the factor variable from the original brandid variable, by assigning the correct brand labels as its levels (we insisted on the correct ordering so ‘Other’ and ‘Don’t know’ were at the bottom of the label stack). We then merged the data files correctly. However, we were then unable to recode this factor variable from the back-code file. For, firstly, the values in this factor variable were all text - whilst they were integers in the back-code file - and secondly, we didn’t have the correct number of levels in the original variable, and there seemed to be no easy way to add new levels within the existing sequence.

Our mistake was that we created the factor variable too early. We should have created it after all the recoding was completed.

So, instead, we first created a new brandid from the old brandId and the backcode variable. Then we converted this variable into the required factor variable. This solved all our problems. The code is below. (Note that we have created the new brandid inside the new factor variable.) We have included the import and merge syntax, and a few tricks (e.g., how to make the brand list vector from the brandlist csv file) plus the checking cross table. It can easily be amended as required.

In our case, the files were brand_data.csv, backccode.csv, and brandlist.csv.

#Get original data

ScoreData=data.frame(read.csv('brand_data.csv', header=TRUE))

#Get back codes file

BackCodeData=data.frame(read.csv('back_coding.csv', header=TRUE))

#Get new brand list (so this includes any new brands after back-coding)

Brands=read.csv('brandlist.csv', header=TRUE)

#Merge original data file with back-code file on column='Id'

#Note outer join and only importing backcode column

MergedData=merge(ScoreData,BackCodeData[c('ID','backcode')], by='ID', all=TRUE)

#Make new factor variable which checks whether there is a back-code and if so,

#use this as new code, else uses original brand code



#Turn brand field from brand list data frame into a vector

#Note sorting by field Brand Code to avoid problem of alphabetical sort on

#(possible) factor column Brand


for (i in Brands[order(Brands$brandid),'brand']){BrandList=c(BrandList,i)}

#Set the levels of the NewCode to be the brand list vector


#Do the same for the old brand list by first dropping the new codes

#then changing old brand into factor variable




for (i in Brands[order(Brands$brandid),'brand']){BrandList=c(BrandList,i)}

#Set the levels of the original code to be the brand list vector


#Drop back-code column as no longer required


#Check counts by cross old by new brands



If you really wish to mimic SPSS in R with variable and value labels for categorical variables, you can use the labelled package, see


This also has details of the Haven package to import SPSS data files into R. Labelled variables will allow you to use the same tricks for back-coding as outlined above (although you will still need to create the various labels so they can be used in the for loop in the code above). Note that, you must still turn labelled variables into factors to conduct categorical analysis in R.

Author: Paul Brown, Director

1 view


We solve the data problems that seem unsolvable.


  • Jump Data Twitter Account
  • Jump Data Linkedin Account

Jumpdata Limited. Company No. 08241266.

JumpData Logo
Jump Data Linkedin Account
Jump Data Twitter Account