Monday, September 14, 2020

How a programmer sees a spreadsheet

Summary

Recently a non-programmer co-worker was giving me a hard time about not liking spreadsheets (Microsoft Excel or Google Sheets). He thinks it's the funniest thing that:
I HATE SPREADSHEETS!

If it's not obvious this is a rant with lots of explanation, there is much to be learned for new programmers and hopefully a few users can understand spreadsheets a little better.

There are no curse words but the tone is negative and/or cynical.

Background

First a little background, I've been a professional business application programmer for 20 years. In that time I've worked with a lot spreadsheets, everything from helping users create formula's to maintaining a VBA spreadsheet that runs 100's of queries against a MySQL database, then applies formulas to the results. I've also created a Google Sheet that does a mail merge and is used to send over 5,000 (non-SPAM) emails per day.

I've seen data loss, data corruption and tried to explain to users why these things happened. I understand the limits of a spreadsheet and have pushed those limits, it's not pure hate for spreadsheets it's about using them to solve the right problem.

How a programmer sees a spreadsheet

The most common request I receive is to "Import a spreadsheet into a database". This seems simple and shouldn't be hard because every program has this feature and the user has worked really hard on the spreadsheet and it has all the right fields and data for their task and it already works.

Process Questions:

  • One time or scheduled import?
  • Expected number of records?
  • How many people edit the file?
  • What is the source of the data?
  • How do they change the data from the original?
Even once these questions are answered, it opens up so many more questions I've found it's easier to just ask for a copy of the spreadsheet so I can look at it myself.

Data Quality Questions:

  • What is the actual file format? XLS, XLSX, CSV, Other?
  • How many columns, rows?
  • How many sheets?
  • Are there column headers?
  • Same number of column headers as columns?
  • Does the data format look consistent?
  • How should errors be handled?
With these questions answered I can usually answer yes or no, if I want to take on the task. If I have any questions about the task at this point the default answer is NO. Why? If my program fails because the user changed the spreadsheet it's my program that looks bad, and I lose trust with the users.

User enters bad data = Program Crashes = Decreased trust of users

Real Situations I've seen regarding spreadsheets:

  • User made the spreadsheet prettier, and broke almost all of the code, because the code depended on certain data in certain places. The user actually performed the task manually for months before reporting the problem.
  • 15+ digit numbers are truncated and get a zero at the end (Known Issue)
  • Everything more than 65,536 Rows is deleted in XLS (Known Issue)

More Technical examples:

  • I've received a CSV file with an XML formatted SOAP response, that was then parsed with Excel sub-string formulas that worked only on test data. It broke quickly when used on real data.
  • CSV files that don't quote strings properly so using quote marks doesn't work ex. "60" TV" similar situation with peoples names with an apostrophe like O'Reilly.

Databases aka Structured Data

tl;dr Database data is HIGHLY structured, which improves data quality. 

Databases are a big topic and the only real thing to understand for this post is that they provide a very structured way to store data. Including tables, rows, columns, data types, as well as rules regarding unique values and the relationships between sheets/tables.

For example MySQL has 5 different data types to hold a whole number depending on the minimum and/or maximum value that will be stored. Whole numbers don't have any decimal places and can be positive or negative. There are other datatypes to handle everything else.

It sounds easy, simply choose the largest size, but choosing the min/max size of data can improve the quality of the data in your database. 

For example to store a persons age you shouldn't allow a negative number, but allowing a value greater than 1,000 is not valid. So an unsigned "TINYINT" can hold any value from 0 to 255, still a little big but it's the best match. 

Now lets look at a spreadsheet like a programmer

I've recreated the spreadsheet my co-worker wanted me to import with some dummy data.


Any programmers following along are already crying, here is why...
  • 3 sets of column headings in the first 20 rows, note mixed case F3 and F11
    • This suggests that it's manually maintained 
    • Columns could have misspellings that are fixed in some places and not others
    • How does my program know that row 11 is another column header and not another row of data?
  • Black background rows may hide data from users, but doesn't hide it from a programmer.
  • Colors wont be included in a CSV export.
  • The City Name is 8 cells merged into 1. What row/column would it be in a CSV?
    • Request Field is also merged
  • I'd have to convert the x columns into a True/False column
  • Column 6 appears to be a phone number
    • Will it always be formatted as a raw number?
    • How many different ways can a phone number be formatted?
    • Should I store it as a number or a string?
  • Column 7 is a name with both first and last in the same cell
    • Do I store this as one or two database columns?
    • What about middle names? Middle initials?
  • Duplicates?
    • F5 and F20 are identical, normal or typo?
    • Duplicates in the email address, normal or typo?
  • The "extra" heading in Column H has both a text name and a date
  • Where should I store the extra header data? Same table? Another table?
  • In the original spreadsheet some cells had comments. Ignore? Same table? Another table?
This spreadsheet didn't have any number examples where sometimes there is a money symbol (USD, EUR, etc), or a degree symbol or sometimes decimal sometimes a fraction. Basically things that a person can interpret easily while a program would have to test for each of these special cases.

This is the time to mention that any cell can truly accept any type of data. While there may be "conventions" used in the spreadsheet, if the user doesn't follow them the spreadsheet doesn't care. Many spreadsheets can actually have images inside of cells. 

In addition to everything noted here was the Tab names, I can't make this any worse by making it up, edited minimally for privacy: ch, ACS_Import, FY, January, February, March, April, May, June, July, July Special Circumstance, August, September, October, November, December, Sheet 15. This is only to mention that the tab names are not consistent and the ones selected for import would have to be specified in the import program.

As expected the monthly tabs were very similar to the example above, while the other tabs all had different formats even from each other.


MySQL Enable Remote Connections

Introduction

    This is something I've struggled with for about 5-6 hours now, all of the "helpful" websites have magic incantation command lines that are supposed to fix the problem, if they don't work you don't actually have any more knowledge to help you, so you just go looking for more magic to try.  Here I will attempt to improve this with a troubleshooting guide with some more details behind the commands used.

Possible Problems with Connecting to MySQL from another computer
  • Network Hardware Failure
  • MySQL Bind to an IP Address
  • MySQL User Permissions
  • Network Firewall (Not Complete)
I'm going to assume that the Network Hardware is working, which means that you can open a terminal from another computer. 

My configuration is an ODROID HC1 running Linux Ubuntu and MariaDB, I want to connect from my  Win10 Desktop with HeidiSQL. I'm using Putty to open a bash terminal to the HC1. Even if you are a Windows user I have to assume some knowledge with a navigating with a command line, primarily the 'ls' to list a directories contents and 'cd' to change to another directory, these commands should be enough. Optionally the 'mv' may be used to either move or rename a file. 

My main error is "Can't connect to MySQL server on <IP ADDRESS> (10061) 

I'm also assuming the MySQL (MariaDB) is already installed, but you can't connect remotely. You may of tried a bunch of things and have a very messed up environment and that is OK.

Troubleshooting tools to install

telnet client on both Ubuntu and Windows
  • Ubuntu
    • apt-get install telnet
  • Windows 10
    • Start button
    • Type "Turn Windows" (Features On or Off)
    • Check the box next to "Telnet Client"
    • OK
    • Restart
First steps are to make sure that it works on the Linux computer.
  • Login to the remote Linux server with any terminal program (Win10 use Putty)
  • >>mysql -u root -p
    • password is blank by default, but if you set one use it.
    • The default "host" is 'localhost' which is a unix socket
  • SQL>select user, host, password from mysql.user;
    • Don't forget the semi-colon at the end
    • The password is hashed/encrypted but you can tell if it's set and if it's the same as another user.
  • SQL>quit
At this point we have proven that the mysql server is installed and functioning and has at least a root user with localhost access. Now here is a tricky part mysql actually uses a unix socket to connect the client to the server, so there isn't any networking actually involved
  • If any of the above steps failed maybe the mysql installation needs to be done/re-done
The easiest way to test if MySQL can be accessed outside of the local computer is with the telnet command. The following commands can be re-run at any stage of this article.
  • telnet localhost 3306
    • This uses the Unix Socket so it will work but doesn't prove anything. Although it does show you what you should see when you run the following command
  • telnet <ip address> 3306
    • If you see the same kind of thing you saw from the localhost version above, you can move on to the Network Firewall issues, or MySQL user permission issues
The next steps are to determine the configuration loaded by MySQL and where the values actually come from.
  • >>mysqld --print-defaults
    • One of issues was the bind-address was duplicated (see bold settings) the last one that is set is the value that is used. These all look like command line parameters, but they may come from a command line or from internal default values or from one of multiple configuration files. The challenge is figuring out which one.
    • --bind-address=0.0.0.0 
    • --character-set-server=utf8mb4 
    • --collation-server=utf8mb4_unicode_ci 
    • --character-set_server=utf8mb4 
    • --collation_server=utf8mb4_unicode_ci 
    • --user=mysql 
    • --pid-file=/var/run/mysqld/mysqld.pid 
    • --socket=/var/run/mysqld/mysqld.sock 
    • --port=3306 
    • --basedir=/usr 
    • --datadir=/var/lib/mysql 
    • --tmpdir=/tmp 
    • --lc-messages-dir=/usr/share/mysql 
    • --skip-external-locking 
    • --bind-address=127.0.0.1 
    • --key_buffer_size=16M 
    • --max_allowed_packet=16M 
    • --thread_stack=192K 
    • --thread_cache_size=8 
    • --myisam-recover=BACKUP 
    • --query_cache_limit=1M 
    • --query_cache_size=16M 
    • --log_error=/var/log/mysql/error.log 
    • --expire_logs_days=10 
    • --max_binlog_size=100M 
    • --character-set-server=utf8mb4 
    • --collation-server=utf8mb4_general_ci
  • >>ps -ef | grep 'mysql'
    • This command will show the values that were actually passed on the command line.
  • >>mysqld --verbose --help | grep -A 1 "Default options"
    • This will list the location of all configuration files that were read when mysql started.
    • NOTE: In many cases (all?) one of the config files may "includedir" configuration files from other locations.
      • The configuration files for my system were in this order
      • /etc/my.cnf
      • /etc/mysql/my.cnf
      • /etc/mysql/mariadb.cnf
      • /etc/mysql/conf.d/*.cnf
      • /etc/mysql/mariadb.conf.d/*.cnf
      • ~/.my.cnf
Now review the list from print-defaults and figure out where each value comes from. Reminder: use 'sudo' to elevate your permissions in order to edit these text files, 'vi' is a good editor if you are familiar with it. If you are usually a Windows user or otherwise aren't sure 'nano' is a little easier to use. 
Examples:
  • sudo vi /etc/my.cnf
    • Look up a cheatsheet if you don't know the commands
  • OR
  • sudo nano /etc/my.cnf
    • Ctrl+O and <Enter> to save the change
    • Ctrl+X to exit nano
Here are my results:
/etc/my.cnf --bind-address=0.0.0.0 
/etc/mysql/conf.d/my.cnf --character-set-server=utf8mb4 
/etc/mysql/conf.d/my.cnf --collation-server=utf8mb4_unicode_ci 
/etc/mysql/conf.d/my.cnf --character-set_server=utf8mb4 
/etc/mysql/conf.d/my.cnf --collation_server=utf8mb4_unicode_ci 
command line --user=mysql 
command line --pid-file=/var/run/mysqld/mysqld.pid 
command line --socket=/var/run/mysqld/mysqld.sock 
command line --port=3306 
command line --basedir=/usr 
command line --datadir=/var/lib/mysql 
/etc/mysql/mariadb.conf.d/50-server.cnf --tmpdir=/tmp 
/etc/mysql/mariadb.conf.d/50-server.cnf --lc-messages-dir=/usr/share/mysql 
/etc/mysql/mariadb.conf.d/50-server.cnf --skip-external-locking 
/etc/mysql/mariadb.conf.d/50-server.cnf --bind-address=127.0.0.1 
/etc/mysql/mariadb.conf.d/50-server.cnf --key_buffer_size=16M 
/etc/mysql/mariadb.conf.d/50-server.cnf --max_allowed_packet=16M 
/etc/mysql/mariadb.conf.d/50-server.cnf --thread_stack=192K 
/etc/mysql/mariadb.conf.d/50-server.cnf --thread_cache_size=8 
/etc/mysql/mariadb.conf.d/50-server.cnf --myisam-recover=BACKUP 
/etc/mysql/mariadb.conf.d/50-server.cnf --query_cache_limit=1M 
/etc/mysql/mariadb.conf.d/50-server.cnf --query_cache_size=16M 
/etc/mysql/mariadb.conf.d/50-server.cnf --log_error=/var/log/mysql/error.log 
/etc/mysql/mariadb.conf.d/50-server.cnf --expire_logs_days=10 
/etc/mysql/mariadb.conf.d/50-server.cnf --max_binlog_size=100M 
/etc/mysql/conf.d/my.cnf --character-set-server=utf8mb4 
/etc/mysql/conf.d/my.cnf --collation-server=utf8mb4_general_ci
If you determine you need to edit any of these files use the editor of your choice from above and if you aren't sure about the change you are making I suggest commenting out the line with a '#' as the first character and retyping the line that you want.

Once you modify a configuration files that you think might fix your problem do the following:
  • >>sudo service mysql restart
    • I'm using Ubuntu so your syntax for this may be different
  • >>telnet <ip address> 3306
    • Error or some text that includes "Trying" and the database name "MariaDB"
  • >>mysql -u root -h <ip address> -p
    • I used the blank password that I was using for localhost previously and now this command changed to a new error message for me:
      • Access denied for user 'root'@'<ip address>' (using password: NO)
    • if root has a password assigned for '%' (see the query we ran near the beginning) use that password to try and connect.
NOTE: I was able to connect remotely at this point, so the following isn't as well tested. 

At this point we have a "permission" error. If you already set a password for root for remote access you can use that. Although best practice would be to create a new mysql user with remote access permissions. 
  • >>mysql -u root -p
    • Connect to the mysql server locally to add a new user
  • SQL>create user '<username>'@'%' identified by '<password>';
    • Even better is to use an IP address or range instead of the % character above
  • SQL>GRANT ALL PRIVILEGES ON *.* TO '<username>'@'%';
    • Again if you specified an IP address or range use that instead of the %
  • OPTIONAL SQL>grant grant option on *.* to '<username>'@'%';
    • This is so that the user can grant other permissions to users like 'root'
  • SQL>flush privileges;
I don't know enough about firewall's to really guide someone and my problem is solved and I feel that this improves the available information on this error.

I hope your day is better because of this post.