Jun 13 2020

Name That Algorithm #4

One project that I have been working on recently is gathering property tax information. While the reason isn’t particularly interesting, what is interesting is my recent optimization. Previously, I had been archiving the raw html files. To get the information in a human readable CSV format, I ran a Node.js parser which utilized cheerio. I was only concerned with a couple of the fields, so I just plucked those out and plopped them in a CSV.

After about 6 months of using this tool, the pain points started to set in. Processing an html file in Node and cheerio ended up taking roughly a half second per file. When you have approximately 40,000 files, this takes a long time – over five hours.

There are two low hanging fruit to attack:

Cheerio claims that it is a blazing fast DOM. However, in my case, I didn’t need an entire DOM. I just want to linearly parse my data. Manipulating and rendering is not needed. I had a feeling that the extra bells and whistles held some overhead.

It is common in coding interviews to have data preprocessing as part of a solution to a whiteboard problem. With this being a common in big-data, I knew this to be a solution to my issue

Let’s name the algorithm for processing this data. Instead of showing an algorithm that requires optimizing, I am going to show some data that needs to be processed.

Consider some tabular data of tax payments on a property. In my case this was an HTML table:

| Status | Payment Date | Type   | Amount | Receipt Number |
| Paid   | 01/01/2020   | A      | $1000  | 1234           |
| Paid   | 02/01/2020   | A      | $1000  | 1235           |
| Paid   | 03/01/2020   | A      | $1000  | 1236           |

The tool I chose for the job is Pythons built-in html.parser. It is fairly easy to use the handle_data override to extract the above into an array. I knew that this was going to by snappy since this underlying parser is compiled into bytecode.

[
  'Status', 'Payment Date', 'Type', 'Amount', 'Receipt Number',
  'Paid','01/01/2020','A','$1000','1234',
  'Paid','02/01/2020','A','$1000','1235',
  'Paid','03/01/2020','A','$1000','1236'
]

Now that Python has processed my data into a flat array, we need to label this information, similar to the table above. More specifically, I want something like:

[
  {
    'Status': 'Paid',
    'Payment Date': '01/01/2020',
    'Type': 'A',
    'Amount': '$1000',
    'Receipt Number': '1234'
  },
  ...
]

What algorithm is this? Scroll down for the answer.







I felt it makes sense to use chunk. Since it is guaranteed that empty cells (<td></td>) parse as empty string, we don’t need to worry about offset issues. The result of chunk yields:

[
  ['Status', 'Payment Date', 'Type', 'Amount', 'Receipt Number'],
  ['Paid','01/01/2020','A','$1000','1234'],
  ['Paid','02/01/2020','A','$1000','1235'],
  ['Paid','03/01/2020','A','$1000','1236']
]

Now it’s pretty trivial to get this in the desired object format. Chop off the 0th index and then access each sub-index per attribute. You can even use the invokeMap algorithm if you want to get fancy. Happy coding!