How to Store Data Well: The 10 Commandments of Successful Data Storage

Use the Table of Contents below to easily navigate this article!

  1. Commandment #1: Thou Shalt Beget a Folder Structure Around Thy Data Files
  2. Commandment #2: Thou Shall Not Worship Any Other Gods Before .CSV
  3. Commandment #3: Remember a Naming Convention, and Keep It Holy
  4. Commandment #4: Thou Shalt Keepeth Raw Data Files Raw
  5. Commandment #5: Accompanyith Thy Data with a Metadata File
  6. Commandment #6: Honor Thy Rectangle
  7. Commandment #7: Thou Shalt Covet No Other Forms Over Long Form
  8. Commandment #8: Thou Shall (Probably) Not Leave Cells Empty
  9. Commandment #9: Thou Shall Not Abet “Visual” (Meta)Data
  10. Commandment #10: Thou Shall Not Permit Two Distinct Data Types to Lie Together
  • Need convincing? Look for the Why Do It headings!
  • Just want to be told what to do, like a good little sheep (bah bah!)? Look for the What To Do headings!
  • In a big hurry? Skim the Plain English explanations!

Some will say I’m being dramatic framing my list of “Top Ten Rules of Good Data Storage” as “Commandments.” Ok, yeah, guilty as charged. But the gravity I place upon this issue is earnest: Storing one’s data successfully is incredibly important. After all–and this is one of my most fervent mantras–data are not byproducts of research; they are the entire point of it! So, taking good care of your data is inseparable from doing good research!

Besides, by my count, there are exactly 10 such rules, so how could I resist??

Anywho, my motivation for writing this article is simple: There ARE data storage best practices; they ARE widely appreciated and agreed upon within Data Science circles; and they ARE valuable and important (and increasingly expected!). However, finding them neatly explained all in one place? That’s harder…

So, here’s my attempt to put them all in one place for you! If you want to know more, though, I recommend taking a workshop from the fine folks at Data Carpentrytheir lessons are available for free on their website, if you’d want to work through them at your own pace!

Let’s get to it! If only I had a rockin’ beard to complete the look…

Commandment #1: Thou Shalt Beget a Folder Structure Around Thy Data Files

Plain English: You should be as focused on how your data files are stored on your computer as you are focused on how your data are stored inside your data files. Build a system or structure around your data files that organizes and protects them.

What to do:

  1. At the start of a data project (e.g., the second chapter of your thesis), create a folder on your computer (or, better yet, on a cloud-backed-up platform with version control like Dropbox or Google Drive!) to contain ALL files related to and generated by that project. [See Commandment #2 for why picking a good name for this folder is key!]
  2. Within your project folder, create subfolders for every distinct file type your project will involve or generate. For example, you may need a “Documents” folder for Word files (e.g., your manuscript), a “Data” folder for your data sets, a “Pictures” folder for your figures, a “Code” file for your R scripts, and so forth. An “Old Files” subfolder inside of each of these, while optional, can be useful for keeping subfolders uncluttered. Remember, no one is charging you for extra folders, so use them liberally!

Why do it:

  • One hallmark of good research is “shareability.” If a colleague wants to collaborate, how easily could you share with them everything they need to get “on board?” What could be easier than sharing a single, organized folder, both for you and for them?
  • Many of us have a habit of creating many versions of (more or less) the same files as we move through a project. You know what I mean…”Version 1,” “Version1_2,” “VersionJune19,” “VersionFinal,” “VersionFinalReallyIMeanIt,” etc. This isn’t a very good practice (using version control is the correct alternative! But that deserves its own post…). However, it works better if all versions are at least in one place so they can be found and differentiated quickly–using an “Old Files” subfolder to “hide” older versions makes it work even better.
  • Sorting files by type first (and by purpose/phase second using another layer of subfolders, if desired!) makes it easier to find the files you’re looking for when you’re looking for them!
Everyone’s ideal project folder structure will be different, and it may even vary by project–the key is to have one and really stick to it!

Commandment #2: Thou Shall Not Worship Any Other Gods Before .CSV

Plain English: So-called proprietary file formats (i.e., those only readable/writable by certain (typically not-free) software, such as Microsoft Excel) should be avoided! “Barebones,” text-based formats such as .csv are best in 99% of instances.

What to do:

  1. Plan to save all your data files as plain-text files, specifically as .txt (as tab-delimited files) or .csv (as comma-separated values) files. Even in proprietary software like Excel, these file types are still options on the “Save as” page!
  2. If proprietary formats are absolutely necessary for a particular project (e.g., when working with 3D models or genetic sequence data), at least create a copy of all project data that don’t need to be stored in a proprietary format and store those in a non-proprietary format alongside the proprietary data set so that as much of the data is stored in a plain-text format as possible.
  3. Additionally, avoid using “premium features” of proprietary software (for storage, not analysis!) whenever possible. Sometimes, you may be able to open proprietary files using free software, but free software will often lack access to anything related to premium or proprietary features. For example, comments may be visible in proprietary software but not in free software.
A .CSV file in the flesh. Each row of the original table is a line (made using the enter key aka a “hard return”), and the values in the original columns are separated within a row via commas, hence “comma-separated values.” While maybe not the easiest format to read for a human, if you had to, you could re-read this file by eye to create the original table in a pinch!

Why do it:

  • Simply put: .TXT and .CSV files are so basic as to be almost “un-screw-up-able.” Heck, if YOU opened up a .CSV file, YOU’D be able to read it, if you had to. Computers can’t struggle to read these files, no matter their age, software, or capabilities, and they are difficult to corrupt too.
  • Good research is sharable! Why store your data in a format that can’t be easily shared and viewed by potential collaborators?? .TXT and .CSV formats are great because any computer anywhere can open them using free applications, and this will (so far as we can peer into the future) always be true.
  • Being tethered to proprietary software is a pain–and a risk–for YOU! Are you always going to want to use Excel? Do you know Microsoft will always make Excel? Do you know your future computers will always understand how to read Excel files? And do you know that all the same features will always exist in every future version of Excel? Do you know you will always be able to afford Excel (when it’s no longer free for you because you’re at a University!)? Backwards compatibility and continuity are things Microsoft does consider, but still, best to not risk it!

Commandment #3: Remember a Naming Convention, and Keep It Holy

Plain English: While it may sound like a trade show for expecting parents, a naming convention is actually a set of rules you ALWAYS follow when naming ANYTHING related to your data. Yup, that’s a thing! Keep names 1) Short, 2) Computer-readable, 3) Unambiguous, 4) Descriptive, and 5) Searchable. So…I guess…SCUDS. Yep, let’s go with SCUDS.

Not these SCUDS though…

First, let’s come to terms with SCUDS!

(S) Short: Super long names (especially for columns and variables) are confusing and a hassle to remember and type correctly when writing code. Aim for efficient names.

(C) Computer-readable: Computers tend to hate some things humans tend to like. The three biggest are: A) Spaces, B) Leading numbers, and C) Special characters. Thus, “My Awesome Data,” “2022 Field Data,” and “My @we$some Data!” are bad file names. Stick to only capital and lower-case letters, underscores ( _ ) in place of spaces, and only trailing numbers (“Field Data 2022” is better).

Also, note that many computer languages (e.g., R) are case-sensitive, meaning R is a different character than r, and “Rock” is different from “rock.” Don’t be casual about capital letters in your naming convention! Use them consistently or don’t use them.

(U) Unambiguous: Forgettable, generic, “easy” names tend to be bad! “Data1.csv” is a bad file name–if it’s what the lazy part of your brain reached for first, it’s what it will reach for the next time too! Next thing you know, you’ll have 30 “Data1.csv” files on your computer, and good luck keeping those straight!

(D) Descriptive: The ultimate goal of our naming convention is to ensure we (and our collaborators!) always know what we are looking at. Descriptive names are tantamount to that! “Field Data” is a bad file name. Which field season? What year? “MEAN” is a bad column name. Mean of what? What were the units? How is this mean different than the 100s of others means you’ve quantified? Of course, there’s a fine balance between Descriptive and Short–hence, think efficient.

(S) Searchable: In your naming convention, try to incorporate elements that make files more searchable–including years or dates is helpful, as is the use of keywords or “tags” (e.g., “ZM” for Zebra Mussel-related files). Ponder–what would future you probably type into a search box to try to find this file (or column) again? Anticipate that future search and put that info in the name!

What to do:

  1. Open up a text file, save it in a prominent place, call it “My Naming Convention” or whatever, and Write. Your. Naming. Convention. Down.
    1. How will you construct project folder names? Maybe a “Goal_Lab_Year” format works for you.
    2. What subfolders will you create inside of each project folder? What will these be called? Maybe they need “tags” on them, like “Data_PZ_proj,” rather than just “Data” to be more searchable.
    3. How will you name files? What “pieces” go in what order? If you will allow capital letters, underscores, or numbers, how and when will they be used?
    4. How will you name columns in your data files?
      1. For these, I personally prefer to use all caps (so I don’t have to worry about case-sensitivity), and I aim for something under 20 characters but that packs in as much detail as possible. For example, MEAN_ZM_LKDENS_M might be good. It contains the tag “ZM” to let me know this is “Zebra Mussel” data, it’s tagged with “MEAN” to let me know it’s not raw data but rather derived data (a mean), it contains the tag “LKDENS” for lake-level density, and ends in _M to remind me the denominator units of density are meters rather than kilometers. All for just 16 characters!
        As you can see, good naming is more art than science–do what works for you, but if you have a set of “codes” or “tags” to use (for things like WHAT, WHO, WHERE, WHEN, UNITS, etc.) and an order this stuff is supposed to appear in, coming up with good names can be pretty straightforward!
  2. Once you have your naming convention, Actually. Use. It. Oh, trust me, I know–We all get lazy sometimes, and especially during the thick of projects. Don’t let laziness get to you when your data are at stake!

Why Do It:

  • You absolutely do NOT want to accidentally delete, lose, or overwrite needed data! Generic, forgettable names make those events more likely. They also make searching for particular data or easily understanding what you’re looking at harder. Bad names may be easy to create, but they cost you tons of time and stress down the road.
  • The opposite can also be true! Think about it this way: If you craft a good naming convention when you’re not feeling lazy, all your future lazy self has to do is apply it! A good naming convention makes it easier to be lazy, not harder, as unintuitive as that may seem at first!
  • Duplicate names are both confusing and dangerous–a good naming convention (and an organized file structure) helps to prevent them.
  • The best data in the world are worthless if we (and others) can’t find them when we want to!

Commandment #4: Thou Shalt Keepeth Raw Data Files Raw

Plain English: If I asked you to preserve a document, would you go and scribble on it? I hope not! We should view our goal as preserving our data. That task includes keeping a copy of our raw data that we basically never touch.

What to do:

  1. When you finish entering your raw data for the very first time into a spreadsheet, save this file (perhaps using something like a _RAW tag). After the day this RAW file is made, plan (and hope) to never have to open it again!
  2. Next, immediately create a copy of your raw data file and move it to a second file system somewhere else (a flash drive, Google Drive, your mom’s roommate’s computer, whatever!)–in other words, stash at least one backup of your raw data file somewhere. Your raw data file is your Ring of Power: Keep it Secret, Keep it Safe.
What you should always be asking about your raw data file!!
  1. Then, immediately create yet another copy of your raw data file and change the tag to something like _USE. Whenever you want to do ANYTHING with your data, it’s this version you open to do that, not your _RAW file! Any tinkering, cleaning, wrangling, etc. happens exclusively in your _USE file so that your _RAW file stays raw.
  2. When you DO go to tinker with your data file, here’s how to do it well:
    Open your _USE file AND a text document at the same time. In the text file, write a single sentence (or two) for each change you make to your data file, documenting what you did to your base _USE file to get it to your desired endpoint. For example: “I changed the MEAN_ZM_LKDENS_M variable to MEAN_ZM_LKDENS_KM by dividing by 1000 because I like those units better because reasons.” Think of this text document as a “recipe book” for how exactly you changed your data. Then, save the end data sheet (maybe with the _CLEANED tag) and the text file (maybe with the CLEANING4_ tag) in the same folder.

Why do it:

  • Heavens forbid you ever lose your data file, or it gets corrupted, or you can’t access it, or your hardcopy datasheets burn in a mysterious napalm accident, or…if you’ve got your _RAW data file, you’re fine! It’s somewhere safe and predictable, it’s clearly marked, it’s backed up, and you know exactly how you got from there to your more recent data sheets. This could be four-plus years of your life–having a backed-up raw data file is a very cheap insurance policy for the benefits it potentially offers!
  • If you’re working on collaborative teams, there’s less room for confusion about what files to edit and which to leave alone if you have a raw data file that is distinct from cleaned or “to use” files.
  • Having clear, repeatable, and sharable “recipes” for how your final data sets were generated makes your research more transparent and trustworthy–this is why data repositories often ask for “recipe” files when you submit your data. These also help you remember what you did, so you can redo your work if you have to, retrace your steps to find errors, and defend your work to others.

In general, we want to try to do as little “data wrangling” in a spreadsheet program as humanely possible anyway! For one thing, data wrangling in Excel…well…it just sucks. It sucks a lot. Sorry not sorry Excel! It’s slow and it’s unintuitive and it’s error-prone, etc. etc. Second, doing so is only reproducible (aka re-do-able, by our future selves or our collaborators or our reviewers) to the extent we take really good notes in our “recipe” files. Not only is it easier to manipulate data in R than in Excel, it’s faster and way more reproducible. If you have written a good R script for the purpose, you could use it hundreds of times to do complex manipulations in seconds, all while never affecting the integrity of the original data file. And someone could read that script file to know exactly what you did. Hopefully, it’s clear how much better an approach that is!

Commandment #5: Accompanyith Thy Data with a Metadata File

Plain English: While legacy formats like .CSV are great in most other ways, they are not the best format for storing all-important metadata, i.e., all the data that exist about your data. What do the column names mean? What are the units for that column? What does this abbreviation stand for? How (generally speaking) were the data collected? Are any data points “fishy” in some way? All this information is vital to understanding a data set as well as the person who collected it–it should formally accompany the data set!

What to do:

  1. Whenever you create a new data file, open a text file (perhaps tagging it with a METADATA4_ tag) with a related name. Wherever your data file goes, this text file goes too. In this text file:
    1. Provide a brief summary of the “methods” associated with the collection of your data–if there is a more detailed summary available (perhaps in a publication), you can provide a link/reference/file name rather than rehashing the minor details in your metadata file.
    2. Provide a list of 1-2 sentences for each column. What kind of data is this column holding? What’s the column name mean (if it’s not obvious)? If there are any abbreviations (e.g., the data are species abbreviations), provide a glossary. If there are units, explain. etc. etc.
    3. Provide a 1-2 sentence explanation about what each individual row represents. Is a row a specific lake and everything we know about it? Is it a specific spot in a specific lake? Is it a specific spot in a specific lake on a specific date? In stats terms, rows are (usually) observations (or “cases”); what are the “units of replication” in your data set? [More on this later!].
    4. If you have any data quality concerns (e.g., four of your measurements were taken on a day the probe was acting up) [and these aren’t already marked in a NOTES/ISSUES column], provide a list of these concerns in your metadata file. Go through your field/lab notes and copy any relevant details about potential issues or curiosities into your metadata file so they are backed up and associated with the data. This isn’t “admitting weakness;” this is good science.
    5. If you have any future plans or “checklist items” for the data set you want to remember (e.g., “I want to look for outliers in this column”), this document could be a place for them too.
    6. If you have coded any of your data in any way, meaning you have converted words –> abbreviations or words –> numbers (Green = 1, Blue = 2, Red = 3, etc.), Explain. Your. Codes. In. Your. Metadata. File! We all think we’re going to remember our codes forever. We’re all going to forget our codes before forever! Plus, our collaborators won’t understand them–they need and deserve a guidebook!

Why do it:

  1. Benefit #1 of Good Data Storage: Fewer headaches for future you. We all think we’re going to remember that controls were coded as 1s, treatment 1 individuals were coded 2s, and treatment 2 individuals were coded 3s. But, 8 months and two projects later…”wait, were treatment 2 individuals 2s or 3s?!” See how confusing this can get?? No data storage system is fully self-evident. By leaving yourself a reference document, you give your future self the gift of not having to remember a darn thing about your data! It’s all there for when you need a refresher (or anyone else does!).
  2. Benefit #2 of Good Data Storage: You’re the best collaborator ever. When you share your data folder, your collaborators not only get your data but a crash course on them too. No more 2 hour Zoom meetings of you just trying to explain what all the column names mean! Your collaborators instantly know (roughly) as much about the data as you do, and they can hit the ground running without using up needless time of theirs or yours.
  3. Benefit #3 of Good Data Storage: Streamlined error catching. If you know what your data are supposed to look like, errors stand out more. Suppose you had put “GA” in a cell in your SPECIES column. Hmm, “GA” isn’t a code you explained in your metadata file–might be a typo! If you don’t catch the error this way, someone else looking at your metadata later might, and that’s a good thing.

There are many viable formats for a metadata file! See here for some ideas.

Commandment #6: Honor Thy Rectangle

Plain English: The next three Commandments are closely related. This one says that the best “shape” for a data set is a single rectangle–No multiple sheets, and no “mini rectangles” on the same sheet. Either find a way to combine related data that have been spread over multiple rectangles or sheets into one rectangle or else create multiple data files (which is often unnecessary).

What to do:

  1. Don’t separate related data into multiple “mini-rectangles!” Typically, we do this to separate data by some kind of interesting grouping variable, such as year, treatment, species, field site, etc.–Instead, just make a column for that grouping variable and put all the data in a single rectangle.
  2. Spreading related data across multiple sheets is even worse! If a column (or two, or three) can’t sufficiently differentiate your data within a single rectangle, or if the data were gathered at multiple, incompatible scales (e.g., some were gathered on individual lakes and some were gathered on individual counties), these belong in two separate data files, even if they are from the same project.
  3. We often like to make cute little “summary tables” in our data files–such as of the means and standard deviations for our treatment groups. However, summary tables do not belong inside our data files!! Move these to a separate “SUMMARY4_” file or to a Word or Powerpoint file. [Or, better yet, create them in R where you can save the code that creates them (and build them into attractive R Markdown Reports)!]
  4. Especially while tinkering with your data, resist the temptation to make multiple copies of (more or less) the same data across multiple sheets or data files.
No, no, a thousand times NO!

Why do it:

  • Simply put: Non-rectangles are not computer readable. If you move your data into R or SPSS (or really any analytical software), “mini-rectangles,” or related data spread out over multiple sheets, will be near-impossible for these programs to usefully interpret or associate.
  • Summarizing, sorting, searching, and filtering data–operations we often do all the time to make sense of our data!–are all harder to do when related data are needlessly spread out.
  • When related data get spread out, it often makes it harder–not easier–to discover meaningful patterns or differences in them because they are too far apart to be visualized, filtered, sorted, and summarized at the same times.
  • It’s much easier for important data (and metadata!) to get lost or overwritten when some of them have “strayed” into a separate mini-table or sheet.
  • Spread-out data is harder to share and harder for colleagues and collaborators to make sense of, let alone for you to make sense of years down the road!
  • With respect to point #4 above, specifically, it’s almost impossible for any naming convention to unambiguously differentiate between many subtly different versions of the same data. Instead, make your changes in a single data file, maintaining both old and new columns if needed (no one is charging you for these!), and track all your adjustments in your metadata file. Trust me–you do not want the pain of having to decide, six months in, which of these 8 nearly (??!) identical columns of the same data is the one you’re supposed to be using!
  • The task of trying to maintain a single, useful metadata file is made much harder by spreading your data out all over the place.

…Need I go on?? No, I think not. ALL HAIL THE MIGHTY RECTANGLE!

What Google Image Search believes a “holy rectangleto be. Yeah, ok, but not what I was going for…

Commandment #7: Thou Shalt Covet No Other Forms Over Long Form

Plain English: This one is a little tricky to explain, so buckle up! Data stored in a rectangle can be arranged in two main ways: 1) So-called “unstacked” or “wide” form and 2) “stacked” or “long” form. For most applications, the appropriate arrangement is long form, in which each row is a “unit of replication/interest” (whatever that means for your data set), and each cell in a given column represents a single datum observed for each unit.

What to do:

  1. First, ask yourself: What are my “units of replication?” If you’re not sure: It’s related to the question or purpose that spurred your data collection. For example, if I am trying to figure out if a pesticide kills a pest, my units of replication are probably individual organisms, and columns are things I know about each one (dead/not dead, weight when alive, life stage, etc.). However, they could also be tanks of such organisms and things I know about each entire tank (% dead, average weight, life stage, tank number, etc.) if I’m interested in the effects of the pesticide on populations rather than on individuals.
    Maybe the best way to think about this question is this: If my graph, at the end of my project, were a scatterplot, what would each point in that plot be? If it were a bar graph instead, and my y-axis data were “counts,” counts of what? The answers to these questions should point you to what your units of replication are…I hope!
  2. Next, (re)arrange your data so that each unit of replication has one (and only one) row. Each column, then, is one thing you know about each unit.
    1. If you find yourself frequently having relevant info for some units but not others, i.e., you need a lot of blank cells, this could be a sign you haven’t found long form yet. What units do you know (roughly) the same amount of information about?
    2. If your study involved “repeated measures” (e.g., you took measurements at the same lake six times over the course of a year) and this is something you’re going to look at explicitly with your analyses, then long-form in your case may actually be each observation of each unit rather than the units themselves. Otherwise, you’d need to store a lot of related data (e.g., “temperature”) across many similar columns (temperature_t1, temperature_t2, temperature_t3). If you find yourself needing to do this, it’s a sign you’re probably not in long form yet!
Wide (left) versus long (right) form. If a single observation of interest is a specific tree species at a specific site, then each such observation should have one and only one row, as it does on the right. Notice how, in wide format, we never even explicitly mention that our data are DBH data–our column names would need to be something like “DBH_SiteA”, “DBH_SiteB”…to adequately do that

Why do it:

  1. The biggest reason: Computer programs like R, SPSS, and even Excel (typically) expect your data to be in long form when they go to summarize, operate on, and graph them. Most won’t even understand how to properly work with wide data, so save time by putting them in long-form to begin with!
  2. Long form tends to make sorting, filtering, and exploring your data easier. Using the example in point #2.2 above, if you wanted to filter by temperatures > 10, you’d have to filter 6 different columns to do this if your data are in wide form (and you’d accidentally hide some values you’re trying to show, if one column had a temp < 10 and another > 10!) but only 1 if they are in long form.

Admittedly, it’s true: We squishy and temperamental humans often instinctively gravitate towards the look and readability of wide-form data. However, we are trying to create durable, computer-readable data files, not human-readable ones. As such, long form is better! Besides, in my experience, once you wrap your head around long form, you won’t miss wide form anymore!

Commandment #8: Thou Shall (Probably) Not Leave Cells Empty

Plain English: Missing data are a fact of life in research. Sometimes, data got lost or thrown out. Sometimes, we just didn’t have the ability to gather a more “complete” data set. Also, when we go to form our data rectangle, we introduce more “missing data,” e.g., when we have some data for some observations but not others. How we represent the lack of data in our data sets matters!

What to do:

  1. Identify all cells that are missing data in your data sheet now (usually, but not always, these will be the empty ones in your rectangle). For each one, confirm–should this datum be missing? Is its missingness a “known” or “unknown” issue? Investigate–and if possible, resolve–all “unknown missing data” issues.
  2. Add details to your metadata file explaining all your “missing data issues,” both those understood and those not understood. What, generally speaking, is missing and why (if applicable)?
  3. Represent missing data CAREFULLY.
    1. To a computer, a blank is different than a ” “ [the space from the space bar]. The former would be a “missing value” whereas the latter would be a string [what computers call text, like “apple” or “plot 1.”]. Because it views missing data and strings as two different things, a computer would handle these differently–this can cause big problems down the road!
      1. By the same token, “missing”, “no data”, “who knows”, etc. would all be treated by the computer as text, not as missing values. You should never use text of any kind to mean “missing” when working with a computer!
    2. FALSE, NA, and NULL are special values to a computer. In general, NULL means “this thing doesn’t even exist,” FALSE means “this thing is 0 for math and doesn’t exist otherwise,” and NA means “this thing exists…somewhere…but we don’t have it, sooooo…mostly ignore it I guess?” Hopefully, from this description, it becomes obvious that, 99% of the time, we want missing data cells to be filled with NA–this is the most appropriate way to get a computer to appreciate that a datum is truly “missing.”
      1. BTW…the special value is usually NA, not na or N/a or N/A. For this reason, NA is a TERRIBLE abbreviation for ANY data purpose!!
    3. To a computer, 0 and NA are VERY different. 0 has a mathematical meaning; NA doesn’t. 0 is logically sortable; NA isn’t. The mean of a set of values containing 0 is a value closer to 0; an NA value would be ignored during that process. You should NEVER use 0 (or any numeric value, such as a negative number or 9999 or whatever) to represent a “missing” datum!!
Pic from White et al 2013’s article Nine simple ways to make it easier to (re)use your data. As the table illustrates, NA is the preferred option for representing missing data. Leaving a cell blank is next-best, but it isn’t as good as NA because it can be hard to tell it apart from spaces (which a computer would read differently) and because it doesn’t look intentional (which can make it look suspicious).

Why do it:

  1. Benefit #4 of Good Data Storage: You can put the computer to work for you. If your data are properly stored, organized, and coded, you can have the computer do tedious things (like calculate the standard deviation of 1000 numbers) for you with ease. Missing data can really erode your ability to do this, though–improperly coded missing data can confuse a computer, and sometimes, it can even mean getting the wrong result instead of no result.
  2. Missing data are, by their nature, “suspicion-inducing.” In a truly worst-case scenario, someone could accuse you of “losing” data on purpose! More innocently, you may have made errors that caused some data you really do have to “go missing.” If you don’t “rescue” them, it’d be a waste of the time, effort, and funding used to gather them in the first place! Only by carefully hunting for and resolving missing data issues when those data aren’t missing–as well as by carefully documenting, coding, and explaining truly missing data–can you rest easy and guard yourself against scary prospects like those I’ve raised here.

Commandment #9: Thou Shall Not Abet “Visual” (Meta)Data

Plain English: Legacy formats, like .CSV, do not store any “visual” (meta)data–any info conveyed by comments, highlights, font styling, merged cells, colors, borders, physical location within a spreadsheet, conditional formatting, font size, etc. Info you’re tempted to represent in any of these ways should instead be stored in A) a NOTES/COMMENTS column as part of your “rectangle,” B) in your metadata file, and/or C) in your field/lab notebook.

What to do:

  1. Save a copy of your current spreadsheets as .CSV files (if you haven’t already), then open those .CSV files and examine them and your “regular” data files side by side. You may notice that, if you had any “visual content” in your spreadsheet file, it’s now gone in the .CSV version. [For example, you were marking some problematic records in bold and explaining them using comments.] Is that ok with you? If not, relocate all such information to one of the locations listed above.
  2. In general, avoid “beautifying” your spreadsheet. I know this is me being a party pooper, but a spreadsheet’s job is not to “look pretty”–it’s to store information durably. To the extent these two jobs work against one another, the “looking pretty” job should always lose! Besides, other software, such as R, can be used to make much prettier tables than you can achieve in Excel anyway, if that’s really what you’re after!
Whatever you’re up to here, what with your fancy colors and borders and right alignment, etc., it’s all for naught–none of it’s going to save in your .CSV file! And that’s actually a good thing. It wouldn’t have saved reliably and usefully anyhow…

Why do it:

  1. Important (meta)data need to not get lost! If it belongs in your data sheet, put it in your data sheet formally (or in your metadata file).
  2. Finding important (meta)data should be predictable–a NOTES/COMMENTS column is a predictable place–an Excel comment is much less so. Formatting, e.g., italics, is even worse. Sure, the problematic values may be marked, but the explanation for what an italic value means could be who knows where (even in a different file altogether)!
  3. What is interpretable, pretty, or even readable is very subjective and personal. Here’s an extreme but telling example: Red may indicate “problematic records” effectively for you, but to someone who is color-blind, it means nothing. Another example: The little icon that indicates a cell has a comment is really small in Excel and Google Sheets–it’d be easy to not notice it at all, especially for older individuals or those with poorer eyesight. And the combined effects of multiple colors, shades, styles, borders, fill colors, etc. colliding in a single document full of numbers might make an outside viewer feel overwhelmed rather than informed. Simpler. Is. Better.

Commandment #10: Thou Shall Not Permit Two Distinct Data Types to Lie Together

Plain English: We want our data to be easily searchable, sortable, filterable, and summarizable. To truly achieve this goal, we can’t store unalike data together in the same column/cell, even if the data feel logically related to us. For example, if you are gathering both the sex and weight of individuals, store weight in a weight column and sex in a sex column rather than having them combined in one column like: [9Male, 12Female, 6Female, 2Male…].

No, not that kind of “sex column!” Sheesh…

What to do:

  1. For every column you begin to create, ask yourself: “Can I foresee a need to put both text and numbers (or sometimes one and sometimes the other) in the cells in this column?” If so, you probably need two (or more) columns–numbers and text very rarely belong together in the same column/cell.
    1. For every column you already have in which numbers and text are comingling, try to find a way to separate them into multiple columns.
  2. For every column you already have, ask yourself: “Is there any way I could break this column into more than one?” Imagine you have an experiment that has a control group and a treatment group, each applied to two sexes (male and female), for a 2×2 design. Rather than have TREATMENTMALE, TREATMENTFEMALE, CONTROLMALE, and CONTROLFEMALE as values in a single column, split these into two columns: SEX [MALE or FEMALE] and GROUP [TREATMENT or CONTROL].
    1. In particular, dates and times are interesting opportunities for “splitting,” even if you also retain the original columns. For example, consider splitting a date into “MONTH,” “DAY,” and “YEAR” columns (if there is more than one of each). See below for more on this, but FWIW, dates have a notorious habit of being very annoying to store and later use successfully, mostly stemming from them having a slew of viable formats that are often language-/country-/time-zone-specific. Decomposing them into DAY, MONTH, and YEAR can just be practical!

Why do it:

  1. Numbers can participate in math; letters can’t. Storing letters with numbers shuts down any math you’d like the computer to do for you on those numbers!
  2. A common reason for numbers and letters to get stored together is units, e.g., 49KG. However, a _KG tag can be added to the column name, if all values in that column share that unit, and no information is lost. Alternatively, if the units change from value to value in a column [oof!], you can just have a second column (with a _UNITS tag) to indicate the units, leaving the numbers in the numerical column to be plain old numbers.
  3. As I said above, a goal to always keep in mind when organizing your data is to make it maximally sortable and filterable. What if you decided you wanted to see if weight was cycling from low to high to low again within each month across your organisms? If you have just a DATE column, this could be tricky–if you also had a DAY column, it would be easier. What if you decided you wanted to get the mean weight of all control males weighed in kilograms over 30 years old? This may require a pretty complex filtering process if all this info is stuffed into one or two columns but easy if AGE, WEIGHT, UNITS, GROUP, and SEX are all separate columns.

There you have it! Follow these 10 Rules and you’ll be heralded as a true disciple of Sound Data Storage! Please leave questions and comments–I am excited to hear your thoughts and perspectives!

Advertisement

3 thoughts on “How to Store Data Well: The 10 Commandments of Successful Data Storage

  1. Hey Alex, great read! I particularly enjoyed your in-depth discussion of naming conventions, since it was something I hadn’t really thought of before. Being a fellow tech blogger myself, I also really appreciate how organized and well-formatted everything was – it definitely made the content much more digestible overall. Keep up the awesome work!

    Like

    • Thanks Rohak! I’m very glad you enjoyed it! And I have to say that I feel like a rookie when it comes to design, so I appreciate the positive feedback! Still, if you have suggestions on how I can improve, let me know!

      Like

    • No problem, just followed you – interested to read future articles! I’m actually a newbie at blogging as well and from my experience, I think it’s easy to get caught up in Search Engine Optimization (SEO) and Keyword Inclusion. I think that consistently producing long-ended content (1,000+ words) and prioritizing article structure over SEO-related efforts is a good strategy. Wish you the best going forward!

      Like

Comments are closed.