OK, I’ll start by doing something you aren’t supposed to do: parse HTML using a bunch of regular expressions in Python. Don’t do this. Life is too short. Use something like BeautifulSoup. I couldn’t, because I had to use standard libraries only, so here I am.
Anyway, my program manipulated a tree of XHTML using
xml.etree.ElementTree. Typical stuff like any tutorial on it:
import xml.etree.ElementTree as ET ... root = ET.parse(htmlfname).getroot() bod = root.find('body') ...
Then after a bunch of scraping, I used a regex like this to get the address in the href of the link:
anchor_addr = re.search("<a href=\"([^\"]*)\"",anchor_base).group(1)
(I’m converting HTML to Markdown. Don’t ask why.)
Anyway, this worked fine on Windows. I pushed the code, and a Unix machine in the CICD pipeline built it, and failed on this line. The match would fail, and
anchor_addr wouldn’t have a
.group(1), because it was a
At first, I thought it was the typical problem with linefeeds, like my code was splitting strings into arrays with
\n and on Windows it had
\r\n. After messing around with that, I found it it wasn’t the case.
Here’s the problem:
xml.etree.ElementTree uses a dictionary to store the attributes of an element it parses. Python dictionaries are inherently unordered. Or they can be unordered; it’s an implementation detail. And it looks like the version of Python I was using on Windows was ordering them, but the ones I was using on my home Mac and on this unix build machine were returning the attributes alphabetically. So
<a href="foo" alt="hi"> was becoming
<a alt="hi" href="foo"> and breaking my regexp.
My code didn’t really need to find the entire element and pull the value of the attribute, because it was already inside the element. So I was able to change that regexp to
"href=\"([^\"]*)\"" and that worked, provided it was never
Long story short, don’t use regular expressions to parse HTML.