Jun 13 2020

Name That Algorithm #4

One project that I have been working on recently is gathering property tax information. While the reason isn’t particularly interesting, what is interesting is my recent optimization. Previously, I had been archiving the raw html files. To get the information in a human readable CSV format, I ran a Node.js parser which utilized cheerio. I was only concerned with a couple of the fields, so I just plucked those out and plopped them in a CSV.

After about 6 months of using this tool, the pain points started to set in. Processing an html file in Node and cheerio ended up taking roughly a half second per file. When you have approximately 40,000 files, this takes a long time – over five hours.

There are two low hanging fruit to attack:

Cheerio claims that it is a blazing fast DOM. However, in my case, I didn’t need an entire DOM. I just want to linearly parse my data. Manipulating and rendering is not needed. I had a feeling that the extra bells and whistles held some overhead.

It is common in coding interviews to have data preprocessing as part of a solution to a whiteboard problem. With this being a common in big-data, I knew this to be a solution to my issue

Let’s name the algorithm for processing this data. Instead of showing an algorithm that requires optimizing, I am going to show some data that needs to be processed.

Consider some tabular data of tax payments on a property. In my case this was an HTML table:

| Status | Payment Date | Type   | Amount | Receipt Number |
| Paid   | 01/01/2020   | A      | $1000  | 1234           |
| Paid   | 02/01/2020   | A      | $1000  | 1235           |
| Paid   | 03/01/2020   | A      | $1000  | 1236           |

The tool I chose for the job is Pythons built-in html.parser. It is fairly easy to use the handle_data override to extract the above into an array. I knew that this was going to by snappy since this underlying parser is compiled into bytecode.

  'Status', 'Payment Date', 'Type', 'Amount', 'Receipt Number',

Now that Python has processed my data into a flat array, we need to label this information, similar to the table above. More specifically, I want something like:

    'Status': 'Paid',
    'Payment Date': '01/01/2020',
    'Type': 'A',
    'Amount': '$1000',
    'Receipt Number': '1234'

What algorithm is this? Scroll down for the answer.

I felt it makes sense to use chunk. Since it is guaranteed that empty cells (<td></td>) parse as empty string, we don’t need to worry about offset issues. The result of chunk yields:

  ['Status', 'Payment Date', 'Type', 'Amount', 'Receipt Number'],

Now it’s pretty trivial to get this in the desired object format. Chop off the 0th index and then access each sub-index per attribute. You can even use the invokeMap algorithm if you want to get fancy. Happy coding!