Jun 13 2020

Name That Algorithm #4

One project that I have been working on recently is gathering property tax information. While the reason isn’t particularly interesting, what is interesting is my recent optimization. Previously, I had been archiving the raw html files. To get the information in a human readable CSV format, I ran a Node.js parser which utilized cheerio. I was only concerned with a couple of the fields, so I just plucked those out and plopped them in a CSV.

After about 6 months of using this tool, the pain points started to set in. Processing an html file in Node and cheerio ended up taking roughly a half second per file. When you have approximately 40,000 files, this takes a long time – over five hours.

There are two low hanging fruit to attack:

Cheerio claims that it is a blazing fast DOM. However, in my case, I didn’t need an entire DOM. I just want to linearly parse my data. Manipulating and rendering is not needed. I had a feeling that the extra bells and whistles held some overhead.

It is common in coding interviews to have data preprocessing as part of a solution to a whiteboard problem. With this being a common in big-data, I knew this to be a solution to my issue

Let’s name the algorithm for processing this data. Instead of showing an algorithm that requires optimizing, I am going to show some data that needs to be processed.

Consider some tabular data of tax payments on a property. In my case this was an HTML table:

The tool I chose for the job is Pythons built-in html.parser. It is fairly easy to use the handle_data override to extract the above into an array. I knew that this was going to by snappy since this underlying parser is compiled into bytecode.

Now that Python has processed my data into a flat array, we need to label this information, similar to the table above. More specifically, I want something like:

What algorithm is this? Scroll down for the answer.

I felt it makes sense to use chunk. Since it is guaranteed that empty cells (<td></td>) parse as empty string, we don’t need to worry about offset issues. The result of chunk yields:

Now it’s pretty trivial to get this in the desired object format. Chop off the 0th index and then access each sub-index per attribute. You can even use the invokeMap algorithm if you want to get fancy. Happy coding!