Size & Power Series IV: A Confidence (Interval) Scheme

Welcome to Article IV in this 5-part article series! You can find the other articles in this series by clicking the links below!

In the last article in this series, we played around with simulations, and we discovered some interesting truths about how samples, taken from populations, work. We learned that making a problem smaller in this way is helpful but that it comes with its own challenges: It’s much easier to get an answer to our question by asking it at the level of a small sample versus at the level of the entire population, but it’s also much more likely the answer will be wrong, most likely by just a little bit but potentially by a lot, and we might never know how wrong (or right) it really is! That’s why we need Confidence Intervals, the subjects of this article.

Let’s start by imagining a scenario: Your friend is decent at darts. She throws a single dart at a dartboard hanging on a wall. Then, she takes the dartboard down, and she marks where on the wall the dart would have landed (whether it had actually hit the dartboard or not!). She then brings you into the room and gives you the following challenge: “Based on my one dart throw, draw a circle on the wall that includes the bullseye of the dart board I was aiming at.

Relatable.

The scenario I’ve just outlined–though probably very weird-sounding!–is super similar to the one faced by statisticians when trying to use one statistic (one dart throw) constructed using a decent sample (the throw was likely to be decent) to make an informed guess about the truth (where the bullseye was). I know the dart was more likely close to the bullseye than far away from it. However, I know even a decent dart thrower can still miss (even a good sample can land far from the truth), even by a lot!

How should I use this knowledge to draw a “reasonable” circle on the wall? If I draw a tiny circle, just around where the dart had landed, I am probably putting too much faith in my friend’s throwing! She’s decent at throwing darts, but even a decent thrower is going to miss the bullseye more often than she will hit it (i.e., it’s more likely a statistic will be unequal to the parameter than equal to it).

If I instead draw a gigantic circle around where the dart landed, one filling basically the entire wall, I’ve probably erred in the opposite direction. Now, I’ve put way too little faith in my friend’s throwing! I’ve also done something sort of silly: Yes, I’ve likely succeeded in drawing a circle around where the bullseye had been, but only by making it virtually impossible that I’d fail! This is likely so vague a response to the challenge as to beg the question why I bothered trying it in the first place!

Like I always say: Set your expectations low enough and you will never fail to meet them.

There are other bad decisions I could make. For example, unless I have more information than I’m telling you about, it would be unwise of me to ignore my friend’s dart throw entirely and instead put the circle somewhere that doesn’t include her mark. Unless I knew she was purposely trying to miss (she had biased her sample), why should I consider her dart throw so untrustworthy as to be ignorable? She wouldn’t do that…would she?

Similarly, unless I know she tends to miss to the left a lot (she is not representing all subgroups in the population), I have no reason not to put her mark in the center of any circle I draw, no matter how big or small I make it. Without any “insider info,” I have to assume that her mark is the best and most reliable information I have about the location of the bullseye, even if it’s still only so useful (i.e., basing a guess for the parameter on our statistic, though likely imperfect, is still likely better than guessing randomly).

Analogy level: Tortured.

That thought experiment, odd as it was, is an apt analogy for crafting a Confidence Interval! Our sample is the dart throw, our statistic is where that throw lands, and the task of putting the circle on the wall that includes the bullseye is us trying to put a Confidence Interval around a parameter. Whether we’re right or wrong (whether we succeed in including the bullseye in our circle or not) and how right/wrong we end up being depends, in large part, on 1) How “good” our sample was to begin with, 2) How much confidence we then want to put in that sample, and 3) How precise (or not) we are comfortable with our circle being.

That third factor above deserves some unpacking! You know how, when the news talks about an opinion poll, they say something like “According to a new poll, 55% of Americans hate X, plus or minus 3%“? It’s like the pollster is trying to hedge their bets; they seem reluctant to commit to a solid number and instead put a “fudge factor” of 3% around their guess.

By now, I hope you see why–they know their statistic is probably close to the truth, but also probably a bit off. So, they want to be precise enough with their “circle” that it projects confidence–we know something about all Americans!–but not so bold that they put too much faith in their sample and stand too high a chance of missing the true value entirely.

That’s building a Confidence Interval in a nutshell: Putting the “right” (really, the desired) amount of “fudge factor” around a statistic to give a “hedged guess” about the value of a parameter. In fact, the formula for generating a Confidence Interval is, at its most simplistic, just: Statistic +/- [some fudge factor].

How big or small a value we put in the right-hand box is ultimately up to us. If we’re feeling like totally noncommittal cowards, we could just put “infinity” in that box (drawing a circle the size of the whole wall). We’d never be wrong, of course, but we’d never feel like we had increased our understanding of the world either!

So that’s really the trade-off–we want to craft a Confidence Interval that is reasonably likely to contain the parameter, even if our sample statistic has missed it by a bit (which it likely has), without making the Interval so large that, on a practical level, we might as well not have bothered!

Ok. Great. But…you know…how do we actually strike that balance between precision and uncertainty successfully? The “easiest” way, in some senses, would just be to go take more samples. As we learned in a previous article in this series, while any one sample is pretty likely to miss the “truth” by a bit, a whole mess of samples will tend to cluster around the “truth.”

If we had 20 dart throws to work with, all splattered randomly around the bullseye, it’s a good bet the bullseye is somewhere in the middle of them, right? [Sidenote: What I’m describing here–“a pile of samples clustering around the truth”–has a name in Stats Land: A sampling distribution, the distribution of statistics we’d get if we took many samples]. This is sort of a “duh” notion, if you think about it: What I’m saying here is just that “more information is better than less.”

The catch here is of course that, often, we can’t get more information. In ecology, one sample of a reasonable size is hard enough to get! If the one sample we have is all we’re going to get, what then?

Apparently, “surrealist memes” are a thing…

Here, a statistician might offer a strange-sounding suggestion: “Can you imagine you’ve taken more samples?” And here we must jump headfirst down a surreal-feeling rabbit hole, but the journey will be worth it, I promise!

As we discussed in an earlier article, to be able to use a statistic to make an informed guess about a parameter in the first place, we have to assume that our sample is representative–it’s just like the population, only mini! If that’s true, then this statement is also true: My sample, copied some number of times X, would look (more or less) exactly like the population. That is, make enough exact replicas of our sample and we’d be able to recreate our population from scratch!

Ok, a little strange a notion, sure, but logical. Now, if that’s true, then this statement is also true: If I did create a perfect replica of my population like this using my sample, I could then draw “new samples” from this replica, and doing that would be the same as taking real new samples from the actual population. That is, we can imagine we’re taking many additional samples, even though we can’t and they aren’t.

Trippy enough for you yet?? “Woah woah woah,” you might say! “How do we know our sample is enough like the population to get away with this?” You might ask. Simple: We don’t! However, if we can’t safely make the assumption that our sample is reasonably like the population, our sample is probably useless in the first place. So, why not roll the dice and see what happens? And besides, hopefully we took all the steps I outlined earlier to take the most representative sample we could….so we’re probably fine.

So, let’s use R to actually do this “imagined sampling” process. I’ll use the same code (more or less) I used in my previous article: We will draw one sample of 50 stomach aches from a population in which 25% of them are caused by hunger. We will then simulate drawing “entirely new samples” by drawing samples of size 50 from just our sample but with replacement, as though our sample had actually been duplicated an infinite number of times. We’ll then grab the “statistic” from each one and plot them in a histogram, just as we did before.

#Draw just the one "true" sample from the population
our_sample = sample(c("Hunger-related", "Not-hunger-related"), 
                    size = 50, 
                    replace = TRUE,
                    prob = c(0.25, 0.75))

#What's the statistic?
(sample_stat = prop(our_sample, 
                    success="Hunger-related"))
prop_Hunger-related 
               0.22 #22%

#Draw 1000 new "samples" from our "true" sample, with replacement!
rand_samples1000_v2 = do(1000)*{
  new_sample = sample(our_sample, #Drawing from our sample...
                      size = 50, #Same size samples.
                      replace = TRUE #With replacement
                      )
  prop(new_sample, success="Hunger-related") #Grab the "statistic" from each "sample."
}

#Here's our histogram of all the different statistics we got.
histogram( ~prop_Hunger.related, data = rand_samples1000_v2)
A histogram of all the simulated “statistics” I got from my “imagined” re-sampling campaign. In my “real” sample, the statistic was 22%, so it was close to the “truth” of 25% but a little off.

There’s so much good stuff to unpack in this one graph that I want us to really sink our teeth into it! Here are some things I want us to come away with, in no particular order:

  • The highest bar in the center of this graph is the one for the bin containing 22%. This makes sense–we were drawing “new” samples using our “old” sample, and in our old sample, 22% of stomach aches were hunger-related. Predictably, our “simulated” samples were generally a lot like the “actual” sample they were taken from.
  • Was our one “real” sample right here? No–it was off from the “truth” of 25% by 3%. Was it wrong by a lot? Also no–back when we drew 1,000 samples from the population [see below], we had some samples that were off by ~25%. Compared to those, being off by just 3% is actually pretty solid!
  • In our simulated re-sampling campaign, we also had some “samples” that were off by a lot–in fact, some of these “samples” had statistics almost ~25% higher/lower than that of the sample they were drawn from! Is it a coincidence that the most extreme “samples” in this exercise were off by a similar amount as extreme “real” samples were off by in our real re-sampling campaign? No.
  • Even though we based this whole exercise on a single sample, and any given sample has the potential to be way off from the truth, was the truth (25%) nevertheless in here somewhere? Yes, and it’s actually relatively close to the middle. If we “drew a circle” around the middle of the x-axis of our graph above, it’d only need to go ~3% on either side of the center to include our parameter. And, if we were comfortable giving ourselves a circle with an even wider “radius”–say, 12%–even a relatively oddball sample, like one with a statistic of 37%, would have still yielded a circle containing the parameter.
  • Notice the shape both histograms we’ve made in this series so far have taken: A bell-shape, with a big hump in the middle, falling off steeply to either side. The hump occurs where values are common–as we’ve seen, more often than not, samples tend to cluster around the truth, so the peak of the hump is around the truth. Meanwhile, the skinny tails occur where values are uncommon–every now and then, we draw a sample that is pretty kooky (the world is a random place!), but so long as our samples are representative, this should happen infrequently. The bell shape we’re seeing here is so, well, normal that we call it a Normal Curve. The Normal Curve–and its ubiquity–is proof that the Central Limit Theorem is real–the more information we have, the more tightly around the truth that information will tend to be clustered.
Our histogram from our 1,000 random samples from earlier, reprinted here for reference. Compare it to our “simulated” re-sampling effort above!

So, as you can see, our “imagined” re-sampling process yielded virtually the same result (histogram) as our “real” re-sampling process in our previous article did, even if it seemed like a strange thing to do! [Sidenote: What I’m describing here–“a pile of “imagined” samples, taken from a single sample, clustering around/near the truth”–has a name in Stats Land: A bootstrap distribution. That’s because we used one sample to “pick ourselves up by our own bootstraps.“]

How does all this help us build a Confidence Interval though?? Patience–we’re nearly there! There’s just one more key idea to unlock.

I want you to look at both the histograms above again. On each, we know what the truth really was: 25% for our “real” resampling and 22% for our “imagined” resampling. We see that while many samples are equal to the truth or nearly so, most were off by a little bit, and some were off by a lot. If we think about these graphs a little differently, they show us not just what the truth likely is but also how far off from the truth we are likely to be on any one go.

Let me show you what I mean. Below, I will show our sampling distribution again from our “real” re-sampling. This time, though, I’ll subtract the truth (25%) from all our statistics.

#Our sampling distribution again, but minus the truth
histogram( ~prop_Hunger.related-0.25, data = rand_samples1000)
Notice that now the center of this graph is 0–the average sample was exactly equal to the truth or nearly so, and X – X = 0.

Now, the x-axis shows us how far off each statistic was from the truth and in which direction. For example, a value of -0.15 means a statistic that was 15% lower than the truth. The difference between a statistic and the truth is called deviation. Now that we have all the deviations, we can ask an important and super-useful question: What was the average deviation? That is, how far off from the truth was the average sample? How far away from the bullseye should the average dart throw be??

This is actually not as simple a value to calculate as it sounds because there are negatives and positives in there. If we try to add up all the deviations, as we would in a “normal mean,” the negatives and positives would just cancel each other out. There are other solutions to this problem (including the one I actually find more intuitive, which involves absolute values), but the one we’ll use involves squaring our deviations to get rid of the negatives, then taking the mean as normal, then undoing the squares with a square root.

#What was the average deviation? 
#Subtract the mean from each value, square the difference to get rid of negatives, take the mean, then undo the squares with a square root. 
sqrt(mean((rand_samples1000$prop_Hunger.related-0.25)^2))
[1] 0.06160519

So, the average sample statistic was 6.016% off from the truth. This means it was not at all uncommon to get a statistic of 19% or 31% even when taking a good sample and even though the truth was really 25%! Such is the nature of taking samples–the world is a random place!

This average deviation may seem like an interesting but otherwise unimportant number, but it’s not! Thinking back to our darts example, imagine you had not just the mark on the wall to work with but also data on how many inches your friend tended to miss the bullseye by, on average. Can you imagine being able to use this information to draw a much better circle? I would think so!

This value is so important, and this is such a standard way of calculating it, that it has a name in Stats Land: the standard deviation, because it’s the standard (typical, average, etc.) amount a sample misses by. But it’s not just any standard deviation: It a standard deviation specifically constituting the size of the expected gap between our sample and the truth. If we consider this gap to be our sample’s “error,” then it makes sense why this particular standard deviation is called the standard error.

Now, I should say that thinking of this value as “a normal average” can be a little misleading. Let’s say you assumed every sample you took was always less than 1 standard error away from the truth. You’d think, given that a standard error is the average amount a sample misses the truth by, your assumption would be wrong ~half the time.

But that isn’t the case–your assumption would actually be right more than half the time, and it’s thanks to the CLT again. Take a look at the histogram above once more. Start at the truth and move up 6%, our standard error. Now, start again at the truth and move down 6%. All the bars you encountered are bars for samples that missed the truth by 1 standard error or less. It’s all the highest bars, right? In reality, the majority of samples missed by 1 standard error or less–not only do good samples tend to be pretty similar to the population, but they can also miss the truth either high or low, giving them twice as many opportunities to be within this range.

In fact, on a perfect Normal Curve, there is a very predictable relationship between a number of standard errors and the % of samples within that many standard errors of the truth:

Roughly 68% of samples will likely be within 1 standard error of the truth, 95% will be within 2 standard errors, and > 99% will be within 3 standard errors. Lowercase sigma (σ) is the symbol for a standard deviation. Photo Credit: Wayne W. LaMorte

If we calculate the % of our samples that were within 1 standard error of the truth, which I do below, it likely will be pretty close to that 68%, even if it’s not exactly 68% because our distribution isn’t a perfect Normal Curve.

#What fraction of deviations were < 1 se?
prop(rand_samples1000$prop_Hunger.related-0.25 < 0.0616 & 
       rand_samples1000$prop_Hunger.related-0.25 > -0.0616)
prop_TRUE 
    0.657 #Close enough!

So, that’s interesting. If we assume any one sample we take is probably pretty “standard” and thus should be within 1 standard error of the truth, about 68% of the time, we’d be right!

[Sidenote: Notice the very careful and tortured way I just said that last sentence! Notice that I did NOT say “You’d have a 68% chance of being right for any given sample.” I didn’t say that because it’d be false. Consider: A sample will either be within 1 standard error or it won’t be. Once we’ve taken that sample, it’s 100% certain that it is in one group or the other (it’s like whether a flipped coin is heads or tails). So, the 68% bit only works when we’re thinking about a whole group of samples–68% of them, on average, will fall one way and the other 32% the other!]

However, let’s make it more interesting by doing the same exercise on the “simulated” re-samples we collected earlier, adjusting the “truth” in our calculation to 0.22, since that’s what it was in our one “true” sample we used to make our resamples.

#Standard error for our simulated resampling campaign.
sqrt(mean((rand_samples1000_v2$prop_Hunger.related-0.22)^2))
[1] 0.05915742

Now, 5.916% is not the same as 6.161%, but it’s pretty close, right? Even though we had to make some strange-sounding assumptions to do this “imagined resampling,” and even though the sample we did it with was a bit off from the population, the process gave us a VERY similar estimate for how much we can expect the average sample to deviate from the truth by!

We’ve come a long way, but we’re finally here. And by “here,” I mean to the math, but don’t worry, you’re ready! Check out what happens when I multiply our truth (0.25) by one minus our truth (0.75), divide that by 50 (our sample size), then take the square root:

> sqrt((0.25*0.75)/50)
[1] 0.06123724

Again, 6.124% is not 5.916% or 6.161%, but it’s quite close, right? So long as we can assume the histograms we would have made would look like Normal Curves, the relationships between Normal Curves and standard errors, as well as those between standard errors and sample sizes [larger samples mean less variability between samples thanks to more information and thus smaller standard errors], are so predictable that we can just do math to estimate our standard errors without even imagining taking multiple samples! To be clear–we can use our one sample to guess not only what a parameter might be but also how much fudge factor we will need to put around our guess to be reasonably sure we’re not wrong, and we can do that because we can use math to make a good estimate as to how far off from the truth our guess was likely to be!

What we have discovered, in this article, is that A) We can get an estimate of our standard error a few different ways, and B) We can then use that standard error to make an educated guess as to how far off our statistic might be from the truth. This means, to make our Confidence Interval, we just need to decide: “How liberal or conservative do I want to be?” Since we construct our CI by doing: statistic +/- [some value], and we know that a standard error of X corresponds to Y % of samples falling within X standard errors of the truth, we can 1) pick a value for Y we’re comfy with, 2) figure out what X must then be, and then 3) shove X in the box!

Let’s say we wanted to be “80% confident.” That is, for 80% of the samples of size 50 we take (sample size matters!), we’d expect our CI to contain the parameter [it’s fair to think about this as a “long-run probability”]. Turns out that 80% of values in a Normal Curve are within 1.28 standard deviations of the center, so for our one sample with a statistic of 0.22, our Confidence Interval would be:

> 0.22 + 1.28 * 0.05916 #Using our simulated SE -- upper bound
[1] 0.2957248
> 0.22 - 1.28 * 0.05916 #Using our simulated SE -- lower bound
[1] 0.1442752

Or 14.4% to 29.6%. We took just one sample of just 50 from an infinitely large population, and that sample wasn’t even super close to the truth. We then just imagined our sample was the population, only smaller, and simulated taking samples from it to get an estimate of the error of a typical sample. We then built a Confidence Interval using that value and our statistic. So many places we could have gone awry, and yet, the result worked–our interval contains our parameter (25%), comfortably even!

Now, the next sample may not be so lucky. Or maybe it’ll be the one after that one that isn’t lucky. With 80% Confidence, we’ll “fail” with our CI ~20% of the time, on average, but 4 out of 5 ain’t bad! And, usually, scientists prefer to be painfully conservative; 95% Confidence is typical, so while our Intervals tend to be quite wide (perhaps sometimes uselessly so), they fail relatively infrequently to corral the truth. Walk down this path enough times (i.e., replicate the same study many times), and even a conservative process will eventually pinpoint the truth!

That concludes this article–the second-to-last one in this series. In the next, and last, article, we’ll see how we can use the logic of Confidence Intervals to make an educated guess, not about the truth, but about the size of the sample (maybe) needed to support our notions about the truth. We’ll also talk about why this approach, while useful, can also be risky or misleading. You can find that last article here!

Advertisement