Summary
Recently a non-programmer co-worker was giving me a hard time about not liking spreadsheets (Microsoft Excel or Google Sheets). He thinks it's the funniest thing that:I HATE SPREADSHEETS!
If it's not obvious this is a rant with lots of explanation, there is much to be learned for new programmers and hopefully a few users can understand spreadsheets a little better.
There are no curse words but the tone is negative and/or cynical.
Background
First a little background, I've been a professional business application programmer for 20 years. In that time I've worked with a lot spreadsheets, everything from helping users create formula's to maintaining a VBA spreadsheet that runs 100's of queries against a MySQL database, then applies formulas to the results. I've also created a Google Sheet that does a mail merge and is used to send over 5,000 (non-SPAM) emails per day.I've seen data loss, data corruption and tried to explain to users why these things happened. I understand the limits of a spreadsheet and have pushed those limits, it's not pure hate for spreadsheets it's about using them to solve the right problem.
How a programmer sees a spreadsheet
The most common request I receive is to "Import a spreadsheet into a database". This seems simple and shouldn't be hard because every program has this feature and the user has worked really hard on the spreadsheet and it has all the right fields and data for their task and it already works.Process Questions:
- One time or scheduled import?
- Expected number of records?
- How many people edit the file?
- What is the source of the data?
- How do they change the data from the original?
Data Quality Questions:
- What is the actual file format? XLS, XLSX, CSV, Other?
- How many columns, rows?
- How many sheets?
- Are there column headers?
- Same number of column headers as columns?
- Does the data format look consistent?
- How should errors be handled?
With these questions answered I can usually answer yes or no, if I want to take on the task. If I have any questions about the task at this point the default answer is NO. Why? If my program fails because the user changed the spreadsheet it's my program that looks bad, and I lose trust with the users.
User enters bad data = Program Crashes = Decreased trust of users
Real Situations I've seen regarding spreadsheets:
- User made the spreadsheet prettier, and broke almost all of the code, because the code depended on certain data in certain places. The user actually performed the task manually for months before reporting the problem.
- 15+ digit numbers are truncated and get a zero at the end (Known Issue)
- Everything more than 65,536 Rows is deleted in XLS (Known Issue)
More Technical examples:
- I've received a CSV file with an XML formatted SOAP response, that was then parsed with Excel sub-string formulas that worked only on test data. It broke quickly when used on real data.
- CSV files that don't quote strings properly so using quote marks doesn't work ex. "60" TV" similar situation with peoples names with an apostrophe like O'Reilly.
Databases aka Structured Data
tl;dr Database data is HIGHLY structured, which improves data quality.
Databases are a big topic and the only real thing to understand for this post is that they provide a very structured way to store data. Including tables, rows, columns, data types, as well as rules regarding unique values and the relationships between sheets/tables.
For example MySQL has 5 different data types to hold a whole number depending on the minimum and/or maximum value that will be stored. Whole numbers don't have any decimal places and can be positive or negative. There are other datatypes to handle everything else.
It sounds easy, simply choose the largest size, but choosing the min/max size of data can improve the quality of the data in your database.
For example to store a persons age you shouldn't allow a negative number, but allowing a value greater than 1,000 is not valid. So an unsigned "TINYINT" can hold any value from 0 to 255, still a little big but it's the best match.
Now lets look at a spreadsheet like a programmer
I've recreated the spreadsheet my co-worker wanted me to import with some dummy data.
Any programmers following along are already crying, here is why...
- 3 sets of column headings in the first 20 rows, note mixed case F3 and F11
- This suggests that it's manually maintained
- Columns could have misspellings that are fixed in some places and not others
- How does my program know that row 11 is another column header and not another row of data?
- Black background rows may hide data from users, but doesn't hide it from a programmer.
- Colors wont be included in a CSV export.
- The City Name is 8 cells merged into 1. What row/column would it be in a CSV?
- Request Field is also merged
- I'd have to convert the x columns into a True/False column
- Column 6 appears to be a phone number
- Will it always be formatted as a raw number?
- How many different ways can a phone number be formatted?
- Should I store it as a number or a string?
- Column 7 is a name with both first and last in the same cell
- Do I store this as one or two database columns?
- What about middle names? Middle initials?
- Duplicates?
- F5 and F20 are identical, normal or typo?
- Duplicates in the email address, normal or typo?
- The "extra" heading in Column H has both a text name and a date
- Where should I store the extra header data? Same table? Another table?
- In the original spreadsheet some cells had comments. Ignore? Same table? Another table?
This spreadsheet didn't have any number examples where sometimes there is a money symbol (USD, EUR, etc), or a degree symbol or sometimes decimal sometimes a fraction. Basically things that a person can interpret easily while a program would have to test for each of these special cases.
This is the time to mention that any cell can truly accept any type of data. While there may be "conventions" used in the spreadsheet, if the user doesn't follow them the spreadsheet doesn't care. Many spreadsheets can actually have images inside of cells.
As expected the monthly tabs were very similar to the example above, while the other tabs all had different formats even from each other.

