*Welcome to Article IV in this 5-part article series! You can find the other articles in this series by clicking the links below!*

- Size & Power Series I: How Stats Logic is Human Logic
- Size & Power Series II: Walk Like a Statistician
- Size & Power Series III: Living in a Simulation
- Size & Power Series V: The Power and the Glory (and the Sample Size)

In the last article in this series, we played around with simulations, and we discovered some interesting truths about how samples, taken from populations, work. We learned that making a problem smaller in this way is helpful but that it comes with its own challenges: It’s much easier to get an answer to our question by asking it at the level of a small sample versus at the level of the entire population, but it’s also much more likely the answer will be wrong, most likely by just a little bit but potentially by a lot, and we might never know how wrong (or right) it really is! That’s why we need **Confidence Intervals**, the subjects of this article.

Let’s start by imagining a scenario: Your friend is decent at darts. She throws a single dart at a dartboard hanging on a wall. Then, she takes the dartboard down, and she marks where on the wall the dart would have landed (whether it had actually hit the dartboard or not!). She then brings you into the room and gives you the following challenge: “*Based on my one dart throw, draw a circle on the wall that includes the bullseye of the dart board I was aiming at.*”

The scenario I’ve just outlined–though probably *very *weird-sounding!–is **super **similar to the one faced by statisticians when trying to use one statistic (one dart throw) constructed using a decent sample (the throw was likely to be decent) to make an informed guess about the truth (where the bullseye was). I know the dart was more likely close to the bullseye than far away from it. However, I know even a decent dart thrower can still miss (even a good sample can land far from the truth), even by a lot!

How should I use this knowledge to draw a “reasonable” circle on the wall? If I draw a tiny circle, just around where the dart had landed, I am *probably *putting too much faith in my friend’s throwing! She’s *decent *at throwing darts, but even a decent thrower is going to miss the bullseye more often than she will hit it (i.e., it’s more likely a statistic will be unequal to the parameter than equal to it).

If I instead draw a *gigantic *circle around where the dart landed, one filling basically the entire wall, I’ve probably erred in the opposite direction. Now, I’ve put way too *little *faith in my friend’s throwing! I’ve also done something sort of silly: Yes, I’ve likely succeeded in drawing a circle around where the bullseye had been, *but only by making it virtually impossible that I’d fail*! This is likely *so *vague a response to the challenge as to beg the question why I bothered trying it in the first place!

There are other bad decisions I could make. For example, unless I have more information than I’m telling you about, it would be unwise of me to ignore my friend’s dart throw entirely and instead put the circle somewhere that doesn’t include her mark. Unless I knew she was *purposely *trying to miss (she had biased her sample), why should I consider her dart throw *so *untrustworthy as to be ignorable? She wouldn’t do that…would she?

Similarly, unless I know she tends to miss to the left a lot (she is not representing all subgroups in the population), I have no reason not to put her mark in the **center **of any circle I draw, no matter how big or small I make it. Without any “insider info,” I have to assume that her mark is *the best and most reliable information I have about the location of the bullseye*, even if it’s still only so useful (i.e., basing a guess for the parameter on our statistic, though likely imperfect, is still likely better than guessing randomly).

That thought experiment, odd as it was, is an apt analogy for crafting a Confidence Interval! Our sample is the dart throw, our statistic is where that throw lands, and the task of putting the circle on the wall that includes the bullseye is us trying to put a Confidence Interval around a parameter. Whether we’re right or wrong (whether we succeed in including the bullseye in our circle or not) and how right/wrong we end up being depends, in large part, on 1) How “good” our sample was to begin with, 2) How much confidence we then want to put in that sample, and 3) How precise (or not) we are comfortable with our circle being.

That third factor above deserves some unpacking! You know how, when the news talks about an opinion poll, they say something like “According to a new poll, 55% of Americans hate X, plus or minus 3%“? It’s like the pollster is trying to hedge their bets; they seem reluctant to commit to a solid number and instead put a “fudge factor” of 3% around their guess.

By now, I hope you see why–*they know their statistic is probably close to the truth, but also probably a bit off.* So, they want to be precise *enough *with their “circle” that it projects confidence–we know *something *about *all* Americans!–but not *so *bold that they put *too *much faith in their sample and stand too high a chance of missing the true value entirely.

That’s building a Confidence Interval in a nutshell: Putting the “right” (really, the *desired*) amount of “fudge factor” around a statistic to give a “hedged guess” about the value of a parameter. In fact, the formula for generating a Confidence Interval is, at its most simplistic, just: Statistic +/- [some fudge factor].

**How big or small a value we put in the right-hand box is ultimately up to us.** If we’re feeling like totally noncommittal cowards, we could just put “infinity” in that box (drawing a circle the size of the whole wall). We’d never be *wrong*, of course, but we’d *never *feel like we had increased our understanding of the world either!

So that’s really the trade-off–**we want to craft a Confidence Interval that is reasonably likely to contain the parameter, even if our sample**

**statistic has missed it by a bit (which it likely has),**

**without making the Interval**,

*so*large that**on a**

*practical*level, we might as well not have bothered**!**

Ok. Great. But…you know…how do we actually strike that balance between precision and uncertainty successfully? The “easiest” way, in some senses, would just be to *go take more samples*. As we learned in a previous article in this series, while any one sample is pretty likely to miss the “truth” by a bit, a whole mess of samples will tend to cluster around the “truth.”

If we had 20 dart throws to work with, all splattered randomly around the bullseye, it’s a good bet the bullseye is somewhere in the middle of them, right? [Sidenote: What I’m describing here–“a pile of samples clustering around the truth”–has a name in Stats Land: A **sampling distribution**, the *distribution *of statistics we’d get if we took many *samples*]. This is sort of a “duh” notion, if you think about it: What I’m saying here is just that “*more information is better than less*.”

The catch here is of course that, *often, we can’t get more information*. In ecology, one sample of a reasonable size is hard

**enough**to get! If the

**one**sample we have is

*all*we’re going to get, what then?

Here, a statistician might offer a strange-sounding suggestion: “Can you *imagine *you’ve taken more samples?” And here we must jump headfirst down a surreal-feeling rabbit hole, but the journey will be worth it, I promise!

As we discussed in an earlier article, to be able to use a statistic to make an informed guess about a parameter in the first place, we have to assume that our sample is **representative**–it’s just like the population, only mini! If *that’s *true, then this statement is *also *true: *My sample, copied some number of times X, would look (more or less) exactly like the population*. That is, make

*enough*exact replicas of our sample and we’d be able to recreate our population from scratch!

Ok, a little strange a notion, sure, but logical. Now, if that’s true, then this statement is also true: *If I did create a perfect replica of my population like this using my sample, I could then draw “new samples” from this replica, and doing that would be the same as taking real new samples from the actual population*. That is, we can

*imagine*we’re taking many additional samples, even though we can’t and they aren’t.

Trippy enough for you yet?? “*Woah woah woah,*” you might say! “*How do we know our sample is enough like the population to get away with this?*” You might ask. Simple: *We don’t*! **However, if we can’t safely make the assumption that our sample is reasonably like the population, our sample is probably useless in the first place**. So, why not roll the dice and see what happens? And besides, hopefully we took all the steps I outlined earlier to take the most representative sample we could….so we’re probably fine.

So, let’s use R to actually do this “imagined sampling” process. I’ll use the same code (more or less) I used in my previous article: We will draw one sample of 50 stomach aches from a population in which 25% of them are caused by hunger. We will then simulate drawing “entirely new samples” by drawing samples of size 50 from just our sample but with replacement, as though our sample had actually been duplicated an infinite number of times. We’ll then grab the “statistic” from each one and plot them in a histogram, just as we did before.

```
#Draw just the one "true" sample from the population
our_sample = sample(c("Hunger-related", "Not-hunger-related"),
size = 50,
replace = TRUE,
prob = c(0.25, 0.75))
#What's the statistic?
(sample_stat = prop(our_sample,
success="Hunger-related"))
prop_Hunger-related
0.22 #22%
#Draw 1000 new "samples" from our "true" sample, with replacement!
rand_samples1000_v2 = do(1000)*{
new_sample = sample(our_sample, #Drawing from our sample...
size = 50, #Same size samples.
replace = TRUE #With replacement
)
prop(new_sample, success="Hunger-related") #Grab the "statistic" from each "sample."
}
#Here's our histogram of all the different statistics we got.
histogram( ~prop_Hunger.related, data = rand_samples1000_v2)
```

There’s so much good stuff to unpack in this one graph that I want us to *really *sink our teeth into it! Here are some things I want us to come away with, in no particular order:

- The highest bar in the center of this graph is the one for the bin containing 22%.
*This makes sense*–we were drawing “new” samples using our “old” sample, and in our old sample, 22% of stomach aches were hunger-related. Predictably, our “simulated” samples were generally a lot like the “actual” sample they were taken from. - Was our one “real” sample right here?
*No*–it was off from the “truth” of 25% by 3%. Was it wrong by a lot?*Also no*–back when we drew 1,000 samples from the population [see below], we had some samples that were off by ~25%. Compared to those, being off by just 3% is actually pretty solid! - In our simulated re-sampling campaign, we also had some “samples” that were off by a lot–in fact, some of these “samples” had statistics almost ~25% higher/lower than that of the sample they were drawn from! Is it a coincidence that the most extreme “samples” in this exercise were off by a similar amount as extreme “real” samples were off by in our real re-sampling campaign?
*No.* - Even though we based this whole exercise on a single sample, and any given sample has the
*potential*to be way off from the truth, was the truth (25%) nevertheless in here somewhere? Yes, and it’s actually relatively close to the middle. If we “drew a circle” around the middle of the x-axis of our graph above, it’d only need to go ~3% on either side of the center to include our parameter. And, if we were comfortable giving ourselves a circle with an even wider “radius”–say, 12%–even a relatively oddball sample, like one with a statistic of 37%, would have still yielded a circle containing the parameter. - Notice the shape both histograms we’ve made in this series so far have taken: A bell-shape, with a big hump in the middle, falling off steeply to either side. The hump occurs where values are common–as we’ve seen, more often than not, samples tend to cluster around the truth, so the peak of the hump is around the truth. Meanwhile, the skinny tails occur where values are uncommon–every now and then, we draw a sample that is pretty kooky (the world is a random place!), but so long as our samples are representative, this should happen infrequently. The bell shape we’re seeing here is so, well, normal that we call it a
**Normal Curve**.**The Normal Curve–and its ubiquity–is proof that the Central Limit Theorem is real–the more information we have, the more tightly around the truth that information will tend to be clustered.**

So, as you can see, our “imagined” re-sampling process yielded *virtually *the same result (histogram) as our “real” re-sampling process in our previous article did, even if it seemed like a strange thing to do! [Sidenote: What I’m describing here–“a pile of “imagined” samples, taken from a single sample, clustering around/near the truth”–has a name in Stats Land: A **bootstrap distribution**. That’s because we used one sample to “pick ourselves up by our own bootstraps.“]

How does all this help us build a Confidence Interval though?? Patience–we’re nearly there! There’s just one more key idea to unlock.

I want you to look at both the histograms above again. On each, we know what the truth really was: 25% for our “real” resampling and 22% for our “imagined” resampling. We see that while *many *samples are equal to the truth or nearly so, *most *were off by a little bit, and some were off by *a lot*. **If we think about these graphs a little differently, they show us not just what the truth likely is but also how far off from the truth we are likely to be on any one go. **

Let me show you what I mean. Below, I will show our sampling distribution again from our “real” re-sampling. This time, though, I’ll subtract the truth (25%) from all our statistics.

```
#Our sampling distribution again, but minus the truth
histogram( ~prop_Hunger.related-0.25, data = rand_samples1000)
```

Now, the x-axis shows us how far off each statistic was from the truth and in which direction. For example, a value of -0.15 means a statistic that was 15% lower than the truth. The difference between a statistic and the truth is called **deviation**. Now that we have all the deviations, we can ask an important and super-useful question: **What was the average deviation?** That is, how far off from the truth was the average sample? How far away from the bullseye should the average dart throw be??

This is actually not as simple a value to calculate as it sounds because there are negatives and positives in there. If we try to add up all the deviations, as we would in a “normal mean,” the negatives and positives would just cancel each other out. There are other solutions to this problem (including the one I actually find more intuitive, which involves absolute values), but the one we’ll use involves squaring our deviations to get rid of the negatives, then taking the mean as normal, then undoing the squares with a square root.

```
#What was the average deviation?
#Subtract the mean from each value, square the difference to get rid of negatives, take the mean, then undo the squares with a square root.
sqrt(mean((rand_samples1000$prop_Hunger.related-0.25)^2))
[1] 0.06160519
```

So, the ** average **sample statistic was 6.016%

**from the truth. This means it was not**

*off**uncommon to get a statistic of 19% or 31% even when taking a good sample and even though the truth was really 25%! Such is the nature of taking samples–the world is a random place!*

**at all****This average deviation may seem like an interesting but otherwise unimportant number, but it’s not**! Thinking back to our darts example, imagine you had not just the mark on the wall to work with *but also data on how many inches your friend tended to miss the bullseye by, on average*. Can you imagine being able to use this information to draw a **much **better circle? I would think so!

This value is *so *important, and this is such a *standard *way of calculating it, that it has a name in Stats Land: the **standard deviation**, because it’s the *standard *(typical, average, etc.) amount a sample misses by. But it’s not just *any *standard deviation: It a standard deviation specifically constituting the size of the expected gap between our sample and the truth. If we consider this gap to be our sample’s “error,” then it makes sense why this *particular *standard deviation is called the **standard error**.

Now, I should say that thinking of this value as “a normal average” can be a little misleading. Let’s say you assumed every sample you took was *always *less than 1 standard error away from the truth. You’d think, given that a standard error is the *average *amount a sample misses the truth by, your assumption would be wrong ~half the time.

*But that isn’t the case*–your assumption would actually be right more than half the time, and it’s thanks to the CLT again. Take a look at the histogram above once more. Start at the truth and move *up *6%, our standard error. Now, start again at the truth and move *down *6%. All the bars you encountered are bars for samples that missed the truth by 1 standard error or less. *It’s all the highest bars, right?* In reality, the *majority *of samples missed by 1 standard error or less–not only do good samples tend to be pretty similar to the population, but they can also miss the truth either high *or *low, giving them twice as many opportunities to be within this range.

**In fact, on a perfect Normal Curve, there is a very predictable relationship between a number of standard errors and the % of samples within that many standard errors of the truth:**

If we calculate the % of our samples that were within 1 standard error of the truth, which I do below, it likely will be *pretty *close to that 68%, even if it’s not *exactly *68% because our distribution isn’t a *perfect *Normal Curve.

```
#What fraction of deviations were < 1 se?
prop(rand_samples1000$prop_Hunger.related-0.25 < 0.0616 &
rand_samples1000$prop_Hunger.related-0.25 > -0.0616)
prop_TRUE
0.657 #Close enough!
```

So, that’s interesting. If we assume any one sample we take is *probably *pretty “standard” and thus should be within 1 standard error of the truth, about 68% of the time, we’d be right!

[*Sidenote: Notice the very careful and tortured way I just said that last sentence! Notice that I did NOT say “You’d have a 68% chance of being right for any given sample.” I didn’t say that because it’d be false. Consider: A sample will either be within 1 standard error or it won’t be. Once we’ve taken that sample, it’s 100% certain that it is in one group or the other (it’s like whether a flipped coin is heads or tails). So, the 68% bit only works when we’re thinking about a whole group of samples–68% of them, on average, will fall one way and the other 32% the other!*]

However, let’s make it *more *interesting by doing the same exercise on the “simulated” re-samples we collected earlier, adjusting the “truth” in our calculation to 0.22, since that’s what it was in our one “true” sample we used to make our resamples.

```
#Standard error for our simulated resampling campaign.
sqrt(mean((rand_samples1000_v2$prop_Hunger.related-0.22)^2))
[1] 0.05915742
```

Now, 5.916% is **not **the same as 6.161%, but it’s *pretty *close, right? Even though we had to make some strange-sounding assumptions to do this “imagined resampling,” and even though the sample we did it with was a bit off from the population, the process gave us a VERY similar estimate for how much we can expect the average sample to deviate from the truth by!

We’ve come a long way, but we’re *finally *here. And by “here,” I mean to the math, but don’t worry, you’re ready! Check out what happens when I multiply our truth (0.25) by one minus our truth (0.75), divide that by 50 (our sample size), then take the square root:

```
> sqrt((0.25*0.75)/50)
[1] 0.06123724
```

Again, 6.124% is not 5.916% or 6.161%, but it’s *quite *close, right? So long as we can assume the histograms we *would *have made *would *look like Normal Curves, the relationships between Normal Curves and standard errors, as well as those between standard errors and sample sizes [larger samples mean less variability between samples thanks to more information and thus smaller standard errors], are *so *predictable that we can just do math to estimate our standard errors without even imagining taking multiple samples! To be clear–we can use our one sample to guess not only what a parameter might be but also how much fudge factor we will need to put around our guess to be reasonably sure we’re not wrong, and we can do that because we can use math to make a good estimate as to how far off from the truth our guess was likely to be!

What we have discovered, in this article, is that A) We can get an estimate of our standard error a few different ways, and B) We can then use that standard error to make an educated guess as to how far off our statistic might be from the truth. This means, to make our Confidence Interval, we just need to decide: *“How liberal or conservative do I want to be?” *Since we construct our CI by doing: statistic +/- [some value], and we know that a standard error of X corresponds to Y % of samples falling within X standard errors of the truth, we can 1) pick a value for Y we’re comfy with, 2) figure out what X must then be, and then 3) shove X in the box!

Let’s say we wanted to be “80% confident.” That is, for 80% of the samples of size 50 we take (sample size matters!), we’d expect our CI to contain the parameter [it’s fair to think about this as a “long-run probability”]. Turns out that 80% of values in a Normal Curve are within 1.28 standard deviations of the center, so for our one sample with a statistic of 0.22, our Confidence Interval would be:

```
> 0.22 + 1.28 * 0.05916 #Using our simulated SE -- upper bound
[1] 0.2957248
> 0.22 - 1.28 * 0.05916 #Using our simulated SE -- lower bound
[1] 0.1442752
```

Or 14.4% to 29.6%. We took just one sample of just 50 from an* infinitely large *population, and that sample wasn’t even *super *close to the truth. We then just imagined our sample *was *the population, only smaller, and simulated taking samples from it to get an estimate of the error of a typical sample. We then built a Confidence Interval using that value and our statistic. So many places we could have gone awry, and yet, the result worked–our interval contains our parameter (25%), comfortably even!

Now, the next sample may not be so lucky. Or maybe it’ll be the one after that one that isn’t lucky. With 80% Confidence, we’ll “fail” with our CI ~20% of the time, on average, but 4 out of 5 ain’t bad! And, usually, scientists prefer to be *painfully *conservative; 95% Confidence is typical, so **while our Intervals tend to be quite wide (perhaps sometimes uselessly so), they fail relatively infrequently to corral the truth.** Walk down this path enough times (i.e., replicate the same study many times), and even a conservative process will eventually pinpoint the truth!

That concludes this article–the second-to-last one in this series. In the next, and last, article, we’ll see how we can use the logic of Confidence Intervals to make an educated guess, not about the truth, *but about the size of the sample (maybe) needed to support our notions about the truth*. We’ll also talk about why this approach, while useful, can also be risky or misleading. You can find that last article here!

Pingback: Size & Power Series V: The Power and the Glory (and the Sample Size) | Alex Bajcz, Quantitative Ecologist

Pingback: Size & Power Series III: Living in a Simulation | Alex Bajcz, Quantitative Ecologist

Pingback: Size & Power Series II: Walk Like a Statistician | Alex Bajcz, Quantitative Ecologist

Pingback: Size & Power Series I: How Stats Logic is Human Logic | Alex Bajcz, Quantitative Ecologist