Size & Power Series III: Living in a Simulation

Welcome to Article III in this 5-part article series! You can find the other articles in this series by clicking the links below!

So far, in the first two articles in this series, we’ve laid out the rationale for why–and, at a general level, how–statisticians try to approach big, challenging questions the ways that we do. In this article and the next one, we’ll get into the specifics of how they do this, using a motivating question we’ve introduced in the previous articles: “What percentage of all stomach aches are caused by hunger?” Once we appreciate the specifics, we can manipulate them to explore just how big our sample might need to be for our test to have a good chance of supporting our hypothesis–that’s the topic of Article 5, and where we’re headed!

The most intuitive–and most fun–way to explore the logic and toolkit of a statistician is via simulation. If not for the simulation modeling course I took as a Ph.D. student at Umaine, I may never have fallen in love with programming, and I may not be where I am today! I can’t stress this enough: Simulation modeling is a tool every ecologist would benefit from learning. You can test hypotheses you’d never be able to test in the field, you can manipulate forces (e.g., the weather) in ways otherwise impossible, and synthesize knowledge of a system to see if it really works like you think it does, and so much more!

No promises.

For our purposes, simulation modeling is also a fantastic way to explore the implications of statistical logic! I’ll switch now to using R, and I’ll be posting both my code and its output below as I go. I’ll simulate us taking a random sample of 50 tummy aches from a HUGE population of them. Here, we can press an advantage that simulation modeling gives us that we don’t normally have in the real world: We know the truth, because we can pick the truth. We know, here, that 25% of all tummy aches are hunger-related!

library(mosaic) #Access some handy statistics simulation tools.
set.seed(475) #Trick to make our randomness more predictable, as odd as that sounds. 

#Take one random sample of 50 subjects from an "infinitely-large" population of tummy aches, in which the true proportion of hunger-related tummy aches is 25%

our_sample = sample(c("Hunger-related", "Not-hunger-related"), 
                     size = 50, 
                     replace = TRUE,
                     prob = c(0.25, 0.75))

#Calculate the statistic from our sample--the proportion of hunger-related tummy aches
(sample_stat = prop(our_sample, 
     success="Hunger-related"))

When I ran the code above, I got the following “answer” for my sample:

prop_Hunger-related 
               0.14 

This is interesting, isn’t it? We know the real “answer” here is 25%, but we got an “answer” from our sample of just 14%. This demonstrates key stats idea #1: Even a representative, large (yes, this counts as large!), and random sample is unlikely to produce exactly the “right answer”–more often than not, it will be close to the “truth,” but the world is a random place, so some samples will be far from the truth for that reason and that reason alone! That’s what is happening here.

For fun, I ran the code above a second time to produce a second random sample and get a second sample statistic. I got this value instead:

prop_Hunger-related 
               0.32 

A statistic of 32% this time–that’s also rather far from the “truth,” but in the opposite direction as our first sample! This demonstrates key stats idea #2: Even representative, large, and random samples are unlikely to be exactly like one another. If any two samples are taken randomly from the population, they are going to be randomly different from not just the population but from each other too! They will both tend to be similar to the “truth,” and thus similar to each other, but as these two results have shown, neither of those things is a certainty!

The beauty of simulating taking samples like this is that we can do something, inside the computer, that we couldn’t dream of doing in the real world (at least, not within the lifespan of the average graduate student!): Let’s take 1,000 random samples and just see how many different “answers” to our question we get! Using the tools in the mosaic package, this is easy, and I don’t say that lightly! It takes just a few lines of code and a few seconds:

#Here, we use do() to do 1000 random samples from the population, collecting and storing the sample statistic each time.

rand_samples1000 = do(1000)*{
  our_sample = sample(c("Hunger-related", "Not-hunger-related"), 
                      size = 50, 
                      replace = TRUE,
                      prob = c(0.25, 0.75))
  prop(our_sample, success="Hunger-related")
}

#And here we can graph a histogram of our 1000 sample statistics!

histogram( ~prop_Hunger.related, data = rand_samples1000)
A histogram of our 1,000 random sample statistics.

In the above graph (a very important type called a histogram), on the x-axis, we’ve divvyed up all our sample statistics we got from our 1,000 samples into “bins,” such as from 0.1 up to ~0.133. Then, we counted up the numbers of statistics falling into each bin. The bar heights reflect those counts (although the y-axis has been rescaled here), so, for example, the bar between 0.25 and 0.28ish being the highest bar means more sample statistics fell into that bin than into any other.

This makes sense, right? The “truth” is 25% (we know this because we picked it!), so samples producing an answer close to 25% should be common, and they were! This is key stats idea #3: While any one sample may be randomly close to or far from the “truth,” a (large-enough) set of (representative) samples will cluster predictably around the “truth!” This is actually the Central Limit Theorem (CLT) in action again, just on a larger scale.

What will you do with all this picking power??

However, you may have noticed that plenty of samples did NOT end up producing answers particularly close to the truth. Remember our 14% and 32% sample statistics from earlier? You can see in the graph above that these were by no means “flukes.” In fact, a whole 72 out of our 1000 samples yielded statistics of exactly 32%. That’s not a lot, but it’s not none either!

This is key stats idea #4: In most cases, it is actually more likely that our statistic will NOT equal the parameter than it is that it WILL equal it, even when it was a “good” sample. Of course, you can get a statistic that is exactly equal to the parameter, but as you can see, you shouldn’t count on it!

So, let’s tie up some ideas here: We stats-loving folks like asking questions and then finding support for explanations that answer those questions. We do this by taking careful samples from our population of interest and seeing what statistic we get from those samples. Importantly, we view our statistic as being an “informed estimate” for what the true parameter might be, but, at the same time, we assume that that estimate is probably still wrong, probably by relatively little but possibly by a lot!

It’s this realization that gives rise to one of the most important statistical tools around: the Confidence Interval. Since the logic–and math–behind Confidence Intervals (CIs) is where the logic (and math) behind finding “ideal sample sizes” comes from, let’s move on to talking about CIs next! That is subject of Article 4 in this series, which you can get to here.

Advertisement