Why You Should Submit to a Data Repository

In this post, I’ll recap the biggest reasons Data Repositories are such important resources and why we should all be using and submitting our data to them, whether we’re actually made to or not!

  1. Reason #1: Eventually, you WILL be made to…
  2. Reason #2: It Can Get Your Work Noticed, (Re)Used, and Expanded upon
  3. Reason #3: It Helps Others Understand Your Work
  4. Reason #4: It’s an Assist to Your Future Self
  5. Reason #5: It Helps Others Verify Your Work
  6. Bonus: A Resource You Contribute to Becomes More Valuable to Use Too

In my new role as the Quantitative Ecologist for MAISRC, I’m aiming to ensure all MAISRC graduate students, post-docs, and other researchers feel ready and excited to submit their data to the Data Repository for the University of Minnesota (DRUM) upon project completion, as this is one of our funding requirements.

What’s a data repository, you ask? It’s a digital “depot” for (varyingly public) data storage and retrieval that also provides a home for the metadata needed to contextualize the data housed there. If Google Scholar or a journal’s website is one online “hub” for scientific research, a data repository is a corresponding hub for the data that research is based on.

While not everyone may feel the same at first, I for one am really excited that we require our researchers to submit to a repository! Data repositories are unquestionably valuable, and their use should (and will inevitably) become more mainstream. Even I need to use them more than I have in the past, so I appreciate being “forced” to do so!

Reason #1: Eventually, you WILL be made to…

Ok, this is maybe not the most “inspiring” reason to submit to a data repository, but it’s one to be aware of, and I think it’s a good place to start! As I mentioned above, sometimes, we all just need to be forced to do things that are good for us, and being required to use repositories qualifies, in my view!

MAISRC is certainly not alone in requiring its researchers to submit their data to a repository! Many scientific journals, including many high-impact ones we may aspire to publish in, like Nature and PLOS ONE, now require authors to publish the corresponding data from publications in a repository so other researchers can check out not just the study but the data collected too.

Many funding agencies, including Federal ones like the Department of Transportation and the US Agency for International Development, also have this requirement or may soon, thanks to recent White House guidance. Most that don’t yet require data submission at least require grantees to craft Data Management Plans (DMPs) that state how grantees will ensure their data (gathered with public funds!) are available to that same public in some way long into the future. Because submitting your data to a repository helps meet the requirements of a DMP (and then some!), they are often actually the path of least resistance.

Incidentally, I couldn’t find any proof to back this claim up, but I wouldn’t be surprised if Graduate Programs at public, research-focused Universities like the University of Minnesota eventually move towards mandating submission of thesis/dissertation data to a repository as a graduation requirement for many of the same reasons cited by journals and funding agencies. And many private enterprises also expect their employees to archive data for internal inquiry purposes.

Bottom line: You’re just not likely to “escape” the clutches of data repositories for much longer anyway (for good reason!), so getting comfortable with them now will put you ahead of the curve! Speaking of those good reasons…

Reason #2: It Can Get Your Work Noticed, (Re)Used, and Expanded upon

Let’s face it: We all yearn for our data to be “discovered” by others for all the right reasons! Publishing your data in a repository opens up so many new routes for potential discovery:

  • Your data could get used in a meta-analysis or comparison study. Your data are easier to incorporate into one of these studies if they’re published and easily accessible, something that data repositories facilitate! As just one example from my own career, we made the data set from Ruan & Robertson 2017 a focal point of this paper in no small part because it was similar to our own data set and it was publicly available, so we didn’t have to ask for it. Not only does getting included in a “re-analysis study” often get your original work cited again, it also usually means that your study gets recapped and recontextualized as well, bringing your study and its insights back into the conversation, perhaps for an entirely new audience.
  • Your data could get linked into a larger data pool or used in new ways. When data, collected by different people for different reasons, come together, good things happen! Simply put, getting your data published helps facilitate others using them for new purposes or linking them to other data elsewhere to yield new insights. This is kind of the goal behind resources like the Minnesota Natural Resource Atlas. Often, analysis of a combination of individual data sets yields new discoveries, opens up new lines of inquiry, and reveals key knowledge gaps. Who knows what new questions or answers your data may spark once they’re out in the ecosystem and are allowed to interact with other data sets and other minds!
  • Your data get used as an example. Textbook authors, stats bloggers, college professors–many groups of people are looking for interesting data to use in examples and homework/exam questions! Making your data readily accessible makes it easy for these folks to use them. Often, your research gets a plug at the same time so that people can understand the context behind the data they’re being asked to consider. For example, my colleague, Dr. John Fieberg, will often highlight one or two data sets in each lecture of his Biometry course, spending several minutes detailing the study’s question, methods, and findings. Such good press can be hard to find!
  • Your published data sets can get you citations. Most data repositories offer a way for others to cite not just your original publication but your data as well. For example, DRUM ensures that data sets housed there get their own DOI numbers. This means that these data sets can be found when searching on platforms such as Google Scholar that return anything with a DOI (see pic below). These citations can advance your career, bolster your credentials, or help you sell your work to funding agencies. As the folks at PLOS ONE correctly point out, your data are not afterthoughts–they are the entire point of a research project. They deserve an opportunity to receive attention on their own merits!
All the entries that begin with “Data in support of:” refer to data sets Kelsey Vitense has published through DRUM.

Bottom line: Putting your data out there give you an entirely new way for your work to be (re)discovered and to produce new and valuable insights!

Reason #3: It Helps Others Understand Your Work

By my own admission, even I–the “quant guy”–can’t always follow the Methods/Results sections of papers. This isn’t really that surprising; explaining one’s entire analytical workflow clearly and succinctly within the narrow confines of a Methods section is really hard!

But we can all agree it’s better for us when others can better understand our work. Looking at a study’s data set can often make a corresponding analysis less opaque (at least for me). “Oh, that’s what that variable is and what it looks like. Oh, that’s what the sample size was for that test (because this is the number of rows in the data set). Oh, that’s how many sub-groups there were.” A lot of these basic questions become much more answerable when you can pop open an Excel file and see the details for yourself!

Plus, offering up one’s data set often means not needing to describe that data set in nearly as excruciating of detail as you might otherwise need to–if people really want to know more, they have your data set to check out!

Reason #4: It’s an Assist to Your Future Self

Often, our future selves and future colleagues will need access to our past data. Finding these data and making sense of them many years later (after we’ve switched jobs, states, and laptops 6 times! Or is that just me…) is often not easy even for the otherwise very organized people who collected the data!

Data repositories help us out there too. They ensure that the data are maintained in a stable location for the long term–no more hunting on our computers for that “lost data file!”

Second, they ensure that the data are in a file format that we and others will always be able to open (e.g., .csv instead of .xlsx), and they ensure that the data are stored in a computer-readable way following standardized data science conventions that eliminate a lot of the biggest sources of “data set confusion.” For example, no more opening data files only to be confronted by a tangle of comments, highlights, and colors, all with their own long-forgotten meanings!

How not to store data for long-term understandability!

Third, they ensure that data are accompanied by the metadata needed to make proper and robust sense of them. How were they gathered? What are the units on this column? What do these abbreviations mean? Which data points are outliers? Based on what criteria? This info (and often much more) will be in the files accompanying the data submitted to a repository. What better way to refresh your future self’s memory on the data than a comprehensive “crash course” written by your past self??

One other point worth mentioning: If you’re really lucky, someone will want to use your data from your study in some way (see Reason #2!). That’s the best feeling! Sometimes, though, just having to try to reach out to you via email to ask for your data feels like too much, so they don’t do it. Other times, they can’t find your most recent contact info, so they end up sending the request to your long-abandoned undergrad email address, where you’ll never see it. And, sometimes, they do get a hold of you, only for you to then have that panic-inducing moment: “Oh no…where did I save that data set? DID I save that data set?! Was it on my old laptop that I lost in that volcano??” Even if you do manage to find the data file, that doesn’t mean you’ll understand what’s in it any longer or feel confident in your ability to explain its contents to anyone else!

Data repositories help to prevent all these issues! Many data repositories, such as DRUM, offer quality control/quality assurance checks when your data are deposited: they check to make sure files open, code runs, and files are clear and organized and haven’t gone missing. They back up the data so that freak accidents can’t result in permanent data loss. They also do checks like making sure missing data are appropriately coded, data were transformed properly, and units are recorded somewhere. They also make sure files and columns are named clearly and consistently so that they can’t get mixed up (by you or anyone else!).

Bottom line: Others don’t need to come to you for your data (they can go straight to the repository), you don’t need to find the data and clean them (they’re already clean), you don’t need to explain the data (that’s what the metadata are for!), and you don’t have to worry about having lost your data files or them no longer opening. Data repositories are worth it for the peace of mind alone!

Reason #5: It Helps Others Verify Your Work

If you’ve made it this far, I figure I have you convinced enough that you’re ready to hear this reason. I saved this one for last because, well, if you’re anything like me, it’s going to sound maybe more like a threat than a benefit! But hear me out.

For many of us, our worst fear is finding out we goofed–the conclusions we reached from our study weren’t actually supported by our data, we inverted the values in a table, we did the “wrong” statistical test, etc…all the stuff of nightmares for many scientists!

Doesn’t putting our data “out there” increase the odds someone will discover them and use them to realize that we’ve made a “mistake?” Well, yes. Isn’t that scary? Well, maybe, but it shouldn’t be.

We all need to get over our fear of “analytical scrutiny” for two key reasons. First, there are few, if any, “statistical hitpeople” out there, going around downloading data sets and verifying that “all your stats were virtuous.” No one has the time or incentive, frankly!

Well, ok, there’s this person I found on Google Image Search, but turns out that they just really like video games! Whew.

Second, being afraid to share our data for fear that someone may use them to discover our errors is, at least in my opinion, fearing the wrong outcome. In my experience, this fear mostly stems from a concern about one’s professional well-being. That is, we worriedly ask “What will happen to my reputation if someone finds out I did my analyses wrong that one time??” [Side-note: Once an analysis gets to a certain level of complexity, the notion of “doing one’s stats wrong” becomes really murky at best…but that’s a topic for a different post!]

Fewer people, myself included, take the time to ask “What if the world (including me!) doesn’t find out I did my analyses wrong until much later? What if no one ever figures it out?” What dawned on me, when I stopped to really think about it, is that even I don’t really benefit when people don’t catch my mistakes.

First, I might continue to conduct my analyses wrong if I’m never corrected! Talk about doubling down on a mistake! Remember that part of the reason we have a system for peer review in science is that it gives us the opportunity to constructively review our peers–if I could prevent a peer from committing an unfortunate analytical error, I’d want that chance!

Second, your reputation (if that’s really all you’re worried about) is probably not going to suffer less if your errors aren’t found out until later in your career–if anything, it may suffer more. After all, shouldn’t you “know better” by the time you’re an established expert? If your mistakes aren’t corrected until after your career is over, that could even cloud your entire legacy!

Third, from an institutional standpoint, uncaught mistakes “pollute the scientific record.” There are mis-analyzed data out there right now, lurking in plain sight. You and others could be basing your careers and billions of taxpayer dollars pursuing lines of inquiry that only appear to be warranted thanks to one errant analysis. You don’t benefit in the long run by pulling the scientific enterprise in the wrong direction, especially if it was an honest mistake that was correctable!

So, putting your data into a repository serves a critical purpose in the “scientific process:” it allows us, the scientific community, to verify that your study’s conclusions make sense in light of your data. That might make you nervous, but it should make you more nervous not to put your data out there and open them up for such verification! In my opinion, at least, issuing a correction (or [gasp!] a retraction!) is far less serious an outcome than letting errant conclusions spread within the scientific literature.

Besides, ask yourself: Would you be inclined to trust a scientist who adamantly refused to share their data? I wouldn’t.

On the plus side, in many cases, the very submission process to a data repository will catch some of the most common (and silliest!) errors and mistakes, such as having mistransformed data, making typos, mixing up units, handling missing data incorrectly, etc. Better for these mistakes to be caught before they see a wider audience, right?

By the way…while I’d never go around scrutinizing someone’s analyses unprompted, if I ever did find evidence that an analysis went wayward upon seeing a published data set, I’d genuinely react in some combination of these three ways:

  • “Well…I’m not sure I’m right about this so I’ll just work with these data in my way and assume their way was also valid…”
  • “Oops!…oh well! I’m sure they did their best.”
  • “Hmm…maybe I’ll find a way to quietly let the authors know…just in case…”

I’m definitely not going to be calling in the “stats police” (can you imagine if this were a thing??), and no one else I know and respect would either!

Bonus: A Resource You Contribute to Becomes More Valuable to Use Too

Really quick, here’s one more reason to submit to a data repository: They exist for YOU to pull data from too! When a collective resource grows larger, it gets more valuable for everyone to use it. By “investing” your data into a repository, you give more people more reasons to come use that resource too, both as contributors and as data consumers, which makes the resource reciprocally more valuable for you–for example, maybe you will have inspired others in your field to put their data there, and you can springboard off of those data for your next project!