As a data science librarian at the University of California, San Diego, Stephanie Labou has seen her share of spreadsheet horror stories. Most haunting was a table of hand-entered GPS coordinates.
“It was a complete mix,” recalls Labou. The table was created by citizen scientists. Some had written the word “Grade”, some “0” and some “o”; some had used superscripts, others not; others wrote “North,” “West,” or neither. “We’re talking about tens of thousands of rows of data with each latitude and longitude annotated differently,” she says. “It was the most inconsistent chart I’ve ever seen.”
Data scientists express strong feelings about using spreadsheets for data analysis. Overall, they prefer programming languages like R and Python, in which analyzes are easier to document and more reproducible. But many researchers are more comfortable with spreadsheets, and being ashamed of using them is counterproductive, says Labou. Sometimes spreadsheets are the quickest way to solve a problem. And there really is no other way to capture tabular data.
Spreadsheets are reactive: cells that depend on other cells are automatically updated when the data changes. They can also be helpful by intelligently formatting data to make it easier to read. Besides, they are everywhere. Spreadsheets are “where data science begins,” says Tracy Teal, open source program director at software developer RStudio in Boston, Massachusetts.
But they’re also trickier than they seem. For example, a function that calculates the average of a column will return the wrong value if the formula doesn’t take into account the correct range of data. Cells that appear blank may not be. And automatic formatting doesn’t always work as expected. Researchers have long known that some genomic studies contain garbled data because Excel incorrectly converted some gene symbols, such as OCT4, in dates. An analysis of around 11,100 articles published between 2014 and 2020 revealed that 31% still contain such errors (M. Abeysooriya et al. PLoS calculation. biol. 17, e1008984; 2021).
As data scientists Karl Broman of the University of Wisconsin-Madison and Kara Woo, then at the University of Washington, Seattle, wrote in 2018: “Spreadsheets, for all their mundane squareness, have been the subject of fear and controversy for decades.” (KW Broman and KH Woo Am. Stat. 72, 2–10; 2018).
Here are six tips for proper use.
Keep raw data raw
Christie Bahlai, a computational ecologist at Kent State University in Ohio, has helped create workshops and teaches spreadsheet best practices courses for ecologists. She says her top piece of advice is to “keep your raw data raw.”
Spreadsheets, Bahlai says, are “tactile”: They’re user-friendly, intuitive, and easy to manipulate. But they’re also “easy to mess up,” and it’s “easy to lose track of what you’ve done.” A wrong mouse click can result in data ending up in the wrong place. And the auto format feature can ruin the data. In addition, the table may contain organizational information that may not be immediately clear. Because of this, Bahlai recommends users make their original spreadsheet a read-only document and work on copies so they can start over if needed.
Bahlai recalls a case where she kept finding single letters in one of the table’s columns as she began processing the data. “I’m like, ‘What does ‘M’ mean? What does “A” mean?” It turns out that a team member had typed “NO SAMPLE” vertically in one of the columns, one letter per line — an organizational decision clear to a human reader but not a computer. As she sorted the table, that visual organization was lost. “It was like sorting out a mess,” she says, laughing. “I realized, ‘Oh, that means something, there’s a message!'”
Make data machine-readable
Spreadsheets offer extensive formatting options, from font design to background fills to borders. This digital “bling” can liven up a table and make it more readable. But when researchers use such styling to encode data, they can run into trouble.
“My top piece [of advice] is not to encode data with color or formatting, but to create another column that can be sorted or filtered,” says Mine Çetinkaya-Rundel, a statistician at Duke University in Durham, North Carolina.
That’s because cell formatting is difficult for downstream users to grasp. “All the tools that data scientists have at their disposal don’t know data that is expressed as formatting rather than text or numeric values,” says Duncan Garmonsway, data scientist at the UK Government Digital Service in Lincoln. Formatting can be lost during routine table manipulations. And researchers might have trouble remembering what the formatting represents when they return to the spreadsheet months or years later.
Luis Verde Arregoitia, a mammalian researcher at the Institute of Ecology (INECOL) in Veracruz, Mexico, had this experience while revisiting an ancient collection of biodiversity records. He had rows highlighted in yellow, orange, or green to show his confidence in the data. “At this point,” he says, “I can’t really remember the exact color-coding scheme I used.”
Data analysis tools expect spreadsheets to be in a specific format: one row with column headings, no merged cells, and one table per page. Ideally, all cells are also filled without data (eg with ‘NA’) and contain exactly one date. For example, to tabulate data from a field study to count insects, use separate columns for insect species and for the census, says Teal, rather than say, “3 red bugs.”
Specialized tools can untangle spreadsheets that deviate from the ideal. For example, Verde Arregoitia’s “unheadr” package processes tables containing rows to subdivide a table into different groups, which he calls “embedded subheadings”. Garmonsway’s Tidyxl and RStudio’s data scientist Jenny Bryan’s googlesheets4 provide ways to extract the formatting.
The key, Labou says, is consistency — decide on an approach, document it, and stick with it. How are species specified? And how should dates be formatted – does 1/2/2022 mean February 1st or January 2nd? Most experts recommend either the YYYY-MM-DD format – the standard of the International Organization for Standardization – or separate columns for year, month and day. Combined with data validation, using separate columns means “there is absolutely no ambiguity,” says Labou. But, Broman warns, it makes calculating date differences harder.
Document your work
While programming scripts can be saved and versioned, keystrokes and mouse clicks are generally not possible. However, spreadsheet users can still document their analysis.
Designate a table (or tab) as a “code book” that documents abbreviations, how data was collected, units of measure, how missing values are represented, the calculations performed, and any metadata needed to understand, process, and maintain the table . “It’s important to write a roadmap for yourself,” says Çetinkaya-Rundel.
Then, Bahlai says, “write the recipe of what you did with your data.” What does each formula do and where does it get its data from? “You’ll regret it when you’re writing your methods and you’re like, ‘Huh, how did I average this?'” she says. (In Excel, you can use the “Audit” function to see the flow of data through the formulas, notes Felienne Hermans, a computer scientist at Leiden University in the Netherlands.)
Check your data
Data analysts often add cross-checks to ensure their data processing code is working as expected. Spreadsheet users can do something similar, says Hermans.
For example, in a study with samples from both cases and controls, the total number of values in the two groups should always equal the number of samples; Last but not least, this cross-check ensures that cells you think are empty are actually empty. “Adding some of those cross-checks so you can see that everything’s okay is actually a really, really good idea,” she says.
You can also “protect” parts of the table from changes and apply data validation to ensure that date columns contain valid data, that numbers fall within specific ranges, or that text fields contain expected terms. Alternatively, Çetinkaya-Rundel suggests using a data entry form (e.g. a Google form) instead of editing the spreadsheet directly. This allows values to be validated as you type, and users cannot accidentally change the document. Finally, Teal says, double-check your work. Data analysis is often iterative, she notes. “You don’t just walk in the door and say, ‘I’m going to do this equation,’ sit down, do it, done.” So once you’ve decided on a workflow, reset it and start over, says them, and just make sure you have the answer you were thinking.
The good news is that data scientists can generally edit spreadsheets regardless of their format. “A key principle I have as a data analyst is that if someone asks me what form I want the data in, I always say ‘as is,'” says Broman. “If the data needs to be reorganized or transformed in some way, I’m always in the best position to do it.” But it’s better, says Labou, to figure out what you want to do with your data before you even start your spreadsheet create. What variables and covariates will you use? Which time steps do you need? What analyzes will you carry out? “Thinking about it ahead of time is one of the best things people can do,” she says.
And consult with your employees, Garmonsway adds. Rules for data organization “are not set in stone anywhere,” he says. “Physicists did not discover them in the fundamental laws of the universe. They came about because it is difficult to work with other people. So if you collaborate in creating your spreadsheet, it’s much more likely to be useful to others because it’s already useful to someone other than yourself.”