The one where I discover an interesting HTML parser conundrumDespite better judgement I decided to code a basic HTML parser. Not the full HTML spec but enough to create a tree of nodes and attributes. I’ve already written a streamable XML parser that has been working for my podcast web app.
Parsing (most) HTML isn’t as complicated as it sounds. Look for a less-than sign < and see if a valid tag like
follows. If that node is a void element or self-closing element it gets appended to the current parent. If it’s an opening tag it becomes the current parent until a matching close tag is found.
There are several HTML elements that I consider opaque and will skip parsing inside.