Jun 13 2020
Name That Algorithm #4
One project that I have been working on recently is gathering property tax information. While the reason isn’t particularly interesting, what is interesting is my recent optimization. Previously, I had been archiving the raw html files. To get the information in a human readable CSV format, I ran a Node.js parser which utilized cheerio. I was only concerned with a couple of the fields, so I just plucked those out and plopped them in a CSV.
After about 6 months of using this tool, the pain points started to set in. Processing an html file in Node and cheerio ended up taking roughly a half second per file. When you have approximately 40,000 files, this takes a long time – over five hours.
There are two low hanging fruit to attack:
- Removing cheerio (and potentially Node)
- Ditching the raw html files by preprocessing all of this data upfront.
Cheerio claims that it is a blazing fast DOM. However, in my case, I didn’t need an entire DOM. I just want to linearly parse my data. Manipulating and rendering is not needed. I had a feeling that the extra bells and whistles held some overhead.
It is common in coding interviews to have data preprocessing as part of a solution to a whiteboard problem. With this being a common in big-data, I knew this to be a solution to my issue
Let’s name the algorithm for processing this data. Instead of showing an algorithm that requires optimizing, I am going to show some data that needs to be processed.
Consider some tabular data of tax payments on a property. In my case this was an HTML table:
| Status | Payment Date | Type | Amount | Receipt Number |
| Paid | 01/01/2020 | A | $1000 | 1234 |
| Paid | 02/01/2020 | A | $1000 | 1235 |
| Paid | 03/01/2020 | A | $1000 | 1236 |
The tool I chose for the job is Pythons built-in html.parser
.
It is fairly easy to use the handle_data
override to extract the above into an array.
I knew that this was going to by snappy since this underlying parser is compiled into bytecode.
[
'Status', 'Payment Date', 'Type', 'Amount', 'Receipt Number',
'Paid','01/01/2020','A','$1000','1234',
'Paid','02/01/2020','A','$1000','1235',
'Paid','03/01/2020','A','$1000','1236'
]
Now that Python has processed my data into a flat array, we need to label this information, similar to the table above. More specifically, I want something like:
[
{
'Status': 'Paid',
'Payment Date': '01/01/2020',
'Type': 'A',
'Amount': '$1000',
'Receipt Number': '1234'
},
...
]
What algorithm is this?
Scroll down for the answer.
I felt it makes sense to use chunk
.
Since it is guaranteed that empty cells (<td></td>
) parse as empty string, we don’t need to worry about offset issues.
The result of chunk yields:
[
['Status', 'Payment Date', 'Type', 'Amount', 'Receipt Number'],
['Paid','01/01/2020','A','$1000','1234'],
['Paid','02/01/2020','A','$1000','1235'],
['Paid','03/01/2020','A','$1000','1236']
]
Now it’s pretty trivial to get this in the desired object format.
Chop off the 0th index and then access each sub-index per attribute.
You can even use the invokeMap
algorithm if you want to get fancy.
Happy coding!