What is an ordinal variable?

I know what you have all been thinking—”Alex, when are you going to talk about statistics?”

You’re right, I’ve been totally remiss. Let me try to make it up to you by tackling a topic I’ve been thinking about a bit lately that throws many would-be analysis champions for a loop: ordinal variables.

Most people, I would think, are familiar with numeric data. Numeric data include data that are continuous (like heights) and data that are less than fully continuous. The latter camp includes data types like integers (i.e., whole numbers, such as counts of dragonflies observed per day), ratios, and proportions. By continuous, I mean the data could theoretically take any value between negative infinity and infinity and go down to any decimal place imaginable, if one had the instrument to measure to that level of precision. Less continuous data violate one or more elements of this definition; for example, proportions may be practically continuous between 0 and 1, but they are bounded by these two values and can’t go outside of them. Thus, not fully continuous.

Most people are also, I think, probably familiar with categorical data. These are classifications that have no innate hierarchy to them. Examples of these include biological sex (male vs. female), plot type (control vs. treatment), color (blue vs. red vs. yellow), and so on. With these data, there’s no reason to suspect that one order for these categories makes any more or less sense than any other; the categories are, for all practical purposes, equal.

These two types of data are very common–so much so that these are probably the main types of data most of us regularly handle. But there is another class of data out there that is unique and surprisingly common: ordinal data. Ordinal data fall sort of in between numerical and qualitative data and thus have some elements of each. Like categorical data, ordinal data are often categories or classifications of some kind–things not easily represented by numbers that wouldn’t just be arbitrary. However, like numerical data, ordinal data have an innate (or reasonable) order to them (hence why they are called ordinal); one way of ordering the categories somehow makes more sense than other potential orders. A classic example of ordinal data would be classifications of a variable being “low,” “medium,” and “high.” These are clearly categories and they are not in any way obviously mapped to specific numbers (e.g. 100 might be “low” for some things but “high” for others). Still, the categories are clearly not “equal” at the same time like true categorical data categories would be. Much like numbers, it instead makes perfect sense to order them in “ascending order,” as I did above. Classic Likert scales (e.g., strongly disagree, disagree, neutral, agree, strongly agree) like those you see on surveys are also ordinal. Even our months are ordinal. January is not innately “6 less than” July, nor would the “average” of January and July innately be April, so these data behave like categorical data in these ways. However, putting July “before” January doesn’t make sense because that’s not how our system of time works, so this is a more numerical property of these data.

There’s one feature of ordinal data that make them especially interesting but also kind of a pain to meaningfully analyze: Like numerical data, the categories in ordinal variables can be logically ordered, but unlike with numerical data, the distances between those categories is unknown, inconsistent, or squishy. This part is hard to wrap one’s head around. Let’s consider a simple example here: Think about the numbers 2, 4, and 6. 6 is higher than 4, and 4 is higher than 2, so these numbers have an innate order to them. Moreover, though, 2 is always and exactly 2 away from 4, and 4 is always and exactly 2 away from both 2 and 6. Anyone who understands our system of numbers and math will always agree with those statements. In other words, both the order of the numbers and their distances from each other mean something both concrete and consistent. By contrast, think about the ordinal categories disagree, neutral, and agree. It’s pretty easy to imagine that neutral is “higher” in some logical sense than “disagree” and that “agree” is “higher still” than “neutral.” No other order would make as much sense for these data. However, can we say, with no other information, that “disagree” is exactly and always 3 less than “neutral?” Or that it’s 6 less than “agree?” For every question I could possibly ask you, would a “neutral” always and exactly mean a “5” to you? If I asked  you and five of your friends the same question, and you all answered “neutral,” could I assume that neutral was always and exactly a 5 for all of you? Probably not, right? “Neutral” has meaning, and it also has a relative meaning when compared with “disagree” and “agree,” but these meanings are not exact–they are wobbly and shift from context to context, moment to moment, and person to person.

Therein lies the unique nature of ordinal data. Like numerical data, there is reliable information contained in the order of ordinal data categories (“agree” is innately higher on the scale of meaning than “neutral”) but, like with categorical data, there is not reliable information contained in the “distances” between ordinal data categories (“agree” may not be always exactly the same “distance” away from “neutral” on a given scale for a given context for every person).

Why does the distinction between ordinal data and other types of data matter, though?? There are two important and opposite reasons. The first is that many researchers treat ordinal data as categorical data in their analyses. I’m not about to say that this course of action is improper or irrational. However, I do have two issues with that. First, treating ordinal data as categorical data can be confusing or frustrating to anyone who recognizes your data have an order to them you seem to be ignoring. Imagine you were looking at a bar graph and the categories on the x-axis were labeled “High light,” “low light,” and “medium light” going from left to right. Now imagine there’s a positive relationship between light levels and your Y-axis variable. The bars are going to be high, low, and medium in length. At first glance, you might reasonably conclude from the graph that there’s no trend–the bars are all over the place! But there IS a trend–you’ve simply obscured it by ignoring the ordinal nature of your data. Not a good look.

The second (and maybe more important) issue I have with the practice of treating ordinal data as categorical data is that doing so is potentially wasting useful information contained in the data. With categorical data, we can extract no information from the order of categories nor from their apparent distances from one another—all categories are distinct and equal, and only the category itself tells us anything useful. Numerical data, meanwhile, are always ordered exactly, such that their relationships to each other have consistent meaning. It should then be apparent, I hope, that numeric data contain more information than categorical data. By the same token, then, ordinal data contain less information than numeric data (distances between categories are squishy) but more than categorical data (their order is always meaningful). So, it’s not that it’s wrong to treat ordinal data as categorical, but it is (potentially) not taking advantage of information you successfully gathered.

To explain how this could happen, we need to think like a computer for a bit. Imagine you are studying the effects of a light treatment on plant height. You run an ANOVA (Plant height vs. Treatment), with the light treatment having four categorical levels: “low,” “medium,” “high,” and “very high.” An ANOVA is, at its core, a linear regression: Your stats program will take your data, attempt to plot it on a numeric plane, and then try to fit a line to it that gets as close to all the points as possible. That means it needs to assign numbers to your data to squish them into a coordinate plane, even if your data aren’t numeric.

With categorical data, this is a tricky proposition, right? There’s no innate numbers we should assign to categorical data, by definition. So, our stats program does the best thing it can do—it codes the data. The default way it does this is by making one category the “reference” category (let’s say it makes “low” the reference for this example). Then, it makes a series of “dummy variables,” which are hard to explain. In this case, the dummy variables would essentially be: “Is this observation “medium?”, “Is this observation “high?”, and “Is this observation “very high?” Then, for each observation, it puts a 1 where the answer to each of these corresponding question is “yes” and a 0 everywhere else. So, an observation from the “low” treatment would have a value of 0, 0, and 0 for these three dummy variables, whereas every other observation would have a 1 somewhere and 0s everywhere else.

Now, the stats program can draw its lines successfully because the categorical data have been squashed by merciless force into a “numerical” form. So, we might get output that looks like this:

  Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.3611 0.7776 1.75 0.082239
Treatment: Medium 3.7607 1.0783 3.487 0.000652
Treatment: High 4.3889 1.1335 3.872 0.000165
Treatment: Very High -0.3544 1.0922 -0.324 0.746096

Maybe you’ve gotten output like this before. You’ll notice we got three separate coefficient estimates, one each for the medium, high, and very high treatments. Where’s low? It’s actually the intercept value here! Essentially, the model is saying that if Treatment was at its reference value (i.e., low) where all these other variables would be equal to 0, the mean plant height would be 1.36. The other three estimates are how much plant height would change, on average, from that intercept value as we move from low to the other treatment. In other words, plant height would increase by 3.76 and 4.39 as we move from low to medium or from low to high, respectively, but it would actually go down (non-significantly) by 0.35 from low to very high.

This tells us something about how treatment is related to plant height, to be sure. But there is something we all know (“high” light is more light than “medium” light) that the program couldn’t factor into its analysis here because we didn’t let it. As such, the program couldn’t give us one answer (one slope value) to our question of how light levels affect plant growth–it could only give us three partial answers to that question (how does low light affect plant growth relative to other light levels).

Many analysts that recognize the potential wastefulness of treating ordinal data in this way respond by treating ordinal data as numerical instead. This is actually not inherently wrong either…so long as you take either of three appropriate precautions: 1) Check for and account for any squishiness in the distances between your categories, 2) Use statistical tools with relaxed assumptions, and/or 3) Operationalize your “scale.” It’s when these precautions aren’t taken that you can get into iffy territory.

Turning back to our regression, let’s imagine we just converted all of our treatment data to the numbers 0, 1, 2, and 3 for low, medium, high, and very high, respectively, and ran a linear regression. This time, we might get results like this:

  Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.4188 0.7048 4.851 3.19E-06
Treatment (Numeric) -0.1043 0.3782 -0.276 0.783

This time, our stats program has treated our treatment data as numerical, integer-based data. 1 is exactly and always 1 less than 2, and 2 is always higher than 1. We find no significant linear relationship between plant height and treatment (p = 0.783). This was despite seeing evidence of some before (e.g., low and high were significantly different). Already, we can see that how we treat ordinal data fundamentally affects the results we get and the conclusions we might draw! One reason for this change in outcome is that we have possibly abused the assumptions of numeric space with this model. We’ve assumed, implicitly, that “high” is exactly and always as far away from “very high” as it is from “medium” when it comes to light levels. We’ve also assumed that “high” means always and exactly the same value in every trial of the experiment. In other words, if “high” is 2, then it’s always and exactly 2 and thus “medium” and “very high” are always and exactly 1 and 3. This *probably* isn’t true. If you have 3 different research students deciding if light levels are high that day, they may not all agree what “high” light levels look like. You might also not agree with yourself about this from day to day! If these categories are really squishy in their relative location along our light “axis,” then we could be in trouble. What if the “very high” is nearly twice as much light, on average, as “high,” but “high” is only 10% more light than “medium,” on average? This would mean that, even though we are treating “medium,” “high,” and “very high” as being equal distances from each other, they’re waaaaay not. Really, they should be more like 1, 1.1, and 1.9, not 1, 2, and 3! That means that our stats program will end up putting these data in the wrong “places” on our graph and then fit a line to them based on that imperfect placement–it’s unlikely to give us a meaningful answer when we do that!

In other words, we need to acknowledge that the distances between our treatment levels are (or at least could be) “squishy.” In particular, we need to consider whether some levels are actually much closer together than others. The most straight-forward way to do this is to see if the relationship between your treatment levels and plant height is non-linear. Non-linear functions can “bend” through squishy space more flexibly than a line can. This is what programs like R do automatically when you put data in a regression model that you’ve classified as ordinal.

  Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.2514 0.7567 1.654 0.1
Treatment 6.2844 1.225 5.13 9.40E-07
Treatment^2 -2.1101 0.3882 -5.436 2.34E-07

In this case, R has added in a quadratic term (a squared term) for treatment in addition to the normal linear term. Both terms come back significant. Thus, as treatment increases from low to medium to high to very high, plant height increases (by 6.28 per category) but at a decreasing rate (height decreases by -2.11 * the number of categories moved squared). This explains why, in our original ANOVA, Low and Very High were not different from one another—Very High light, in our case, seems to be too high and actually stunts growth, whereas there is at least some increase in height from low to medium to high.

So, allowing “bend” in a regression model involving ordinal variables via power terms is one way to handle these types of data. A second way is to use statistical tools, like bootstrapping, that don’t make such hard and fast assumptions about things like distance in your data to begin with. However, there is a third way that I like: Try to estimate the “real” distance between your ordinal categories and then code those data accordingly. In the work I describe here, I defoliated plants by differing amounts to see how that would affect their reproduction. My defoliation treatment had three levels: None, Partial, and Near-total. A classic ordinal variable, right? However, rather than model these data as ordinal by including quadratic terms in my models , I went back to my field notes. Luckily for me, I had bagged and weighed all the remaining leaves my team had collected from my study plots at the end of the year. It turned out that my partial defoliation treatment had 67% more leaves still, after defoliation, as those in my near-total defoliation plots. So, rather than treat my near-total defoliation plots as 0, partial defoliation plots as 1, and no defoliation plots as 2, I had data to indicate my partial defoliation plots were actually more like my no defoliation plots, so I could “shift” those data over a little bit by coding these data as 0, 0.67, and 1 instead. That way, fitting a line to them would be more accurate!

If you know this step may be useful going into a project, it is often not hard to get a rough estimate of where, on a more exact “scale,” your ordinal data will roughly fit. For example, if you plan to use Likert data (i.e. strongly disagree to strongly agree), you can ask during the same survey something like “On a scale of 0 to 100, with 0 being strongly disagree and 100 being strongly agree, what number do you associate with “agree?”). For me, I would say like 60. For you, it might be 80, and for your dog, it might be 85. If I have these data, I can choose to either treat “agree” as the average of what all respondants think “agree” is OR I can even just replace “agree” with the corresponding number from each respondant every time they choose “agree,” thus converting ordinal data into (more or less) numerical data without the hassle of actually having to gather such data!

So, do you now feel comfortable with ordinal data? What did I fail to explain about them that you’d still like to know? What experiences do you have with these types of data? Let me know! Some R code to generate some categorical, ordinal, and numeric data is provided below in case you want to explore a bit!

set.seed(150) #Controls randomness in R

#Create some fake data
Xcategorical = factor(sample(c("Low","Medium","High", "Very High"), 144, replace=T), levels=c("Low", "Medium", "High", "Very High"))
Xnumeric = as.numeric(Xcategorical)-1
Ydata = 0.5 + (7*Xnumeric) + (-2.25*(Xnumeric^2)) + round(rnorm(144, 0, 5))

#Looking at the same model treating the X variable as categorical, numeric, and ordinal
summary(lm(Ydata~Xcategorical))
summary(lm(Ydata~poly(Xnumeric, 2, raw=T)))
summary(lm(Ydata~factor(Xnumeric, ordered=T)))

plot(Ydata~Xcategorical)

For more information, check out this other resource I found helpful!