The most mysterious bug I solved at work

4 hours ago 1

Background

I worked on a team that develops medical software: an e-referrals application used in Australia.

The idea is, if a GP needs to refer a patient to secondary health services, like a hospital or a specialist clinic, they have to write a referral with information about the patient, their history, and the reason for the referral. In the past, this was done by fax, which is very funny because faxing is very old, and it also led to lots of referrals being rejected for not including enough information.

E-referrals make this better for everyone. When the doctor starts a referral, we automatically extract information from the PMS (patient management software, which isn't owned by our company) about the patient's details, ethnicity, BMI, current medications, medical history, and anything else relevant for the referral. The referral form contains a "specialty form", which is a validated form with specific prompts and fields related to the service that the patient's being referred to.

This makes sure all relevant information has been included. And of course, the form contains a "referral letter", a large free text field the GP can write in to explain why they have chosen to refer the patient.

The referral data is then delivered digitally, hence the E in e-referrals. Depending on the destination, there's a few different formats it could be converted to. Some destinations use our Referral Manager product to receive referrals, in which case we don't convert it at all, they just get to access the same data right out of the database in the web UI. Usually, though, it gets converted to either HL7 (an old text-based medical info file format), CDA (an XML document with a linked stylesheet), or just a PDF with the information as text for humans to read. This means it's compatible with lots of different electronic receiving systems from other companies.

We also save a PDF copy of the referral back into the doctor's own PMS, just for their own record-keeping. This lets them see the complete patient history in the PMS rather than having to dig through a bunch of third party apps like ours to find related documents.

My role

When I worked there, I was somewhere between a developer and support. I did maintenance and kept the wheels turning.

Some readers might be surprised to hear that cleaning up old log files so that the servers don't run out of disk space is actually done manually at many software shops. You'd think it would be easy to automate this, you've probably even automated it yourself on your own servers. And sure, it was easy for you to do it on your own servers, where you know them inside out, and where you can just do stuff. In the context of a company, it's trickier than you'd expect.

I also worked with the support team directly. If doctors encounter any bugs in the software, or problems that are too technical for the helpdesk hotline to solve, those were sent to me. I investigated screenshots, any error logs I could get my hands on, and talked to others to figure out whether this is intended behaviour. If it was a bug, I would trace it across the various servers to find a root cause location and look through the source code to determine how it got into this state. Finally, I'd write a Jira ticket with detailed info about the bug and how to fix it. Usually I wasn't the one to check in the code for the bug fix, and I was fine with that.

I am a programmer, and I did make code changes from time to time when I thought I'd be especially suited to them, but usually I left the code to the rest of the team and stayed in my own little area.

You'd be surprised by how many silly, tedious, little template-y annoying tasks I had to do to keep it all running smoothly. I loved when I got to use my intuition to figure out the causes of newly introduced bugs, but most of the time, a situation that had happened before happened again, and there'd be a wiki page written up about it with numbered instructions of what to do to fix it. There was a lot of copy-pasting referral GUIDs into SQL templates.

The most mysterious bug

One of these known repetitive bugs requiring manual intervention would occur every 2-4 weeks. A referral would be submitted successfully by the doctor and stored in our database, and everything would look OK from the doctor's end. But when dequeuing the referral and trying to send it to the destination, it would encounter an error when converting it to one of the aforementioned formats:

Illegal Character entity: expansion character (code 0x2) not a valid XML character

What the heck does that mean?

When I first started this job, what it meant was that I should open the wiki page about this issue and follow the instructions. The instructions involved running an SQL command that basically find-replaces any occurrences of \u0002 (the sequence of those 6 characters) with an empty string, thus removing the invalid character entirely. Then I would add the referral back to the queue and it would be processed successfully.

As my understanding of the systems I worked with broadened, I only found more and more questions:

What the heck is character 0x2?
Where is it in the data?
Why is it here, how did it get here?, what created it?
Was it supposed to be here? Is it bad that the fix is to remove it?
Why does it cause problems?
Why does it happen (as often/as rarely/as regularly as) every 2-4 weeks?

For several months I was content enough with following the process to fix it for each broken referral, one at a time, but each time I did it, the questions gnawed at me more and more.

Whenever the mystery became too much to bear, I would spend some time looking further into the system, and each attempt only added more, weirder questions to my list.

Bad data?

Maybe this weird character came from extracting bad data from the PMS. This isn't a rare occurrence; every so often I'd diagnose a bug in production that was caused by the PMS containing invalid data that you're not normally able to enter, for example a medication with no name, or a BMI calculation with no weight. You can't type in this data fresh, so we were never able to create or replicate these issues in our test environment, and I really don't know how this had happened in production PMS systems. Maybe they updated from an older version where the validation was less thorough? Anyway, we'd dutifully take this data from the PMS, valid or not, and put it in our form. Then, when the doctor submits the referral, it tries to submit bad data, which ends up failing our validations, making the referral undeliverable in the backend. Maybe loading bad data could have been giving us this 0x2 character?

I took a good look at the SQL script that would remove \u0002 to find out which field it was editing. This would tell me where to look. I found that it would remove any occurrences within the specialty form response. That means any specialty form fields, including ones automatically pulled from the PMS, might be ending up with this character. This was a good lead.

I opened up the data of a referral to see which specialty form field it had appeared in, and I was more confused than ever. It was in the referral letter field. This is the free-text field that the doctor types into, it's never autofilled from the PMS. The 0x2 character is in here because the doctor has typed it in! What! How!

What even is this thing?

I should probably explain what character 0x2 is, so you can understand why it's so incredibly improbable for the doctor to have typed it.

Apologies if you already know some of this. The explanation gets more interesting as you read on. My mum reads this blog, so the first paragraph is necessary :)

Computers store everything as bytes. Each letter of the text you're reading is represented by a number in the computer. For example, "A" is 65, and it can go all the way up to 127. That gives plenty of room to store all the letters, symbols, and numbers of the English language, with some free slots left over. The numbers 31 and below aren't needed for any characters you can see. When designing this system, some boffins decided to put "control characters" in those slots below 31. These are invisible characters that don't appear to the person reading the text. Instead, they tell the computer how to lay out the text that comes next.

This was pretty important back in the days of teletypewriters. If it was writing some text across the paper, it wouldn't know when to move down to the next line without the control characters in the text stream. Control character 0xD tells it to move the print head back to the left side of the page, and then 0xA tells it to feed the paper through by one line so that it can start writing on the next line. There were even things like 0x7 which would ring a bell, but nowadays if you try this, your computer will probably just make a sad little beep noise.

Most of these aren't really needed nowadays because computers have much more advanced ways of doing layout and formatting. The only one you'll really see is 0xA to start a new line of text, but even the meaning of this has changed, because computers are smart enough to know when the text is getting too long and can reflow it to the next line without an explicit command. You can still press Enter to insert a 0xA to force a new line though.

So what's 0x2? It's called "start of text", and there's a corresponding 0x3 too, "end of text". Together, these could be used to delimit certain sections of text, to mark them as special or something. This is hard to explain without an example, but I can't actually find any real-world uses for these, because like I said we have more advanced ways of indicating this now, so 0x2 just isn't really used. Medical technology does tend to be really old, so it's possible there's some computer system that still encodes data this way?

If it were used like that, you'd expect to see a corresponding 0x3 in the referral letter, to mark the "end of text". But there isn't one. The referral letter only has 0x2. How strange.

To summarise, character 0x2 is "start of text", but that character doesn't convey semantic meaning, it isn't being used by any systems I know of, and it isn't even being used to delimit a section of text here.

...You can see that whenever I tried looking into this I would end up with more, weirder, questions.

Why every few weeks?

Desperate to find any pattern to go off, I started looking at the summary of all the referrals that have had this problem. I noticed that these referrals tended to be submitted by a few of the same doctors, and for their same patients. This would definitely point to it being bad data in the PMS's patient file. But I already ruled that out, because the character is in the referral letter.

I wondered if there was anything else I could get from this summary. Maybe certain doctors have certain software or certain workflows that make this more likely to happen, and that's why it hasn't happened to anyone else.

I decided to actually try reading the text of a referral letter with problematic data, and immediately I saw something strange: the text is hard-wrapped.

Normally text will reflow by itself to fill the available space, moving to the next line automatically. But in this referral letter, each line is actually marked by a newline character, like the doctor was typing in the text box, got close to the right edge, and then pressed Enter to move to the next line.

Why would they do this? I know doctors in particular can struggle with computers, maybe some of them think of it like a typewriter where you have to control the lines manually, but surely they've noticed, even by accident, that if you write too much it will move to the next line?

How could this text possibly end up hard-wrapped?

...Everything I did just gave me more questions.

What do the hard-wraps mean?

The idea that multiple computer-illiterate doctors would all be doing this seemed a little too far-fetched, but maybe? I opened up another referral letter from the same doctor and read that too, expecting to see the same patterns of Enter key presses, which could help confirm my theory.

I saw the exact same referral letter.

Same text. Same hard-wraps. Same invalid character that had to be replaced.

Just to make sure, I picked a random valid referral letter, and the text didn't have any newlines, which is what I would have expected. Could the hard-wraps somehow be correlated with this 0x2 character?

I took a closer look at the hard-wraps, and noticed something like this:

I'm referring Liam here because of consistent history of slow speech development. Physically, his well♦being is healthy, so likely needs special tutoring, could be related to psychological issues.

(This is a sample that I made up. It's not copied or recalled from any real referral.)

Every line had been hard-wrapped except for the line containing the 0x2 character (which here I've indicated with a diamond). The 0x2 character had appeared where I would have expected a newline to be. It was almost like the newline character had been corrupted and replaced with something invalid. But how could that happen, and how could it happen consistently to all referral letters for the patient, and without affecting the rest of the text?

I figured the hard-wraps were my best lead, so I focused on trying to work out where they had come from. Maybe the doctor had decided to write the letter in a Word document first, and then pasted it into the form, and that introduced hard-wraps? ...No, since the doctor sent multiple identical letters, they probably copy-pasted the text from somewhere else, somewhere persistent. Maybe a PDF file that's part of the patient record? PDF files tend to be pretty "baked in", if you know what I mean, so it would make sense if copying from a PDF introduced newline characters where the lines ended. I tried composing a PDF in Adobe Acrobat, saved it, and copied the text to Notepad. I was right! It had hard-wrapped it, baking the line break locations into the text.

What kind of PDF could the doctor have copied from? Since the letter was written by the doctor in first-person, talking about making a referral, it would only make sense if it was originally copied from a referral letter. The doctor had created a referral, typed in their letter normally, sent it, the PDF writeback copy of the referral was saved into their PMS, and then they opened the PDF, copied the text, and pasted it into a new referral letter!!

Everything was falling into place. I felt an incredible thrill rush through me, like I was seeing through the Matrix directly into this doctor's mind and I was watching through their eyes at exactly what they had done, step by step.

But what does this have to do with the character?

After studying a handful of referral letters from different doctors, I noticed something else. The places where 0x2 appeared instead of a newline seemed to be in sentences where the line would have ended with a hyphen. For the example above, "well-being" is spelled with a hyphen. If that word appeared at the end of the line, the computer would have wrapped it like this:

development. Physically, his well- being is healthy, so likely needs

The hyphen had been transformed into 0x2. So if a hyphen appears at the end of a line of a PDF, and gets wrapped, it gets corrupted somehow? Is this a problem in the software we're using to build our writeback PDFs, where it doesn't generate the PDF properly in this niche case?

I tried to retrace the doctor's steps. I opened up my test copy of referrals, wrote a referral where the referral letter was just a bunch of hyphenated words, and prayed that one of them would happen to be split across a line after the referral was converted to PDF. It took me a few tries, and I finally managed to get a PDF that matches this. But looking at it in the PMS's writeback area, it looked normal, the hyphens rendered completely normally.

I started a new referral and copied and pasted the text from the PDF into the new referral letter. And there it was. One of the lines didn't get hard-wrapped. There was a little faint box symbol in the middle of the line. And when I sent the new referral and checked the logs, there was an error:

Illegal Character entity: expansion character (code 0x2) not a valid XML character.

But why did it do that!

Was our PDF writing library somehow so broken that it made a PDF where the character appeared normal but got copied weird? I guess it's possible. I've seen PDFs of books with OCR'd text where I highlight and copy something and it copies the incorrect OCR output instead.

I don't really know how to debug PDFs, so I did the first thing that came to mind. I opened the glitchy PDF in Firefox and used devtools to investigate each line of text. On the line with the hyphen, I studied the characters at the end, but they all seemed normal, both visually and in devtools. Hm. Not what I expected. Just to double check, I tried copying the text out of the main PDF again, and this time I got a real hyphen. Copying the PDF didn't give me 0x2 the second time. How? It's the same PDF!

I retraced my steps and copied the PDF from the PMS writeback again. It put 0x2 on the clipboard.

So the PDF itself is fine. The character is being introduced by different PDF viewers?

I tried copying this text in all the PDF viewers I have access to: Firefox, Chrome, Edge, and Adobe Acrobat. (I checked Opera too, but they just use the Chrome one, of course. It's surprising that Edge has a different one, since it's literally Chrome.) Across those four different PDF viewers, I saw four different behaviours when copying the hyphen:

It copies a line break and a hyphen (great)
It copies just a hyphen (sure)
It copies nothing (uh)
It copies 0x2 (what!!!!!)

When copying a hyphen at the end of a text-wrapped line in Microsoft Edge's PDF viewer, you get 0x2 on your clipboard. This is the default program to open PDF files on most Windows systems. The PMS uses Edge's PDF viewer.

After that, I put my findings into a Jira ticket and sent it to the dev team. Now that we knew that 0x2 was meaningless, and we couldn't control if it appeared in the text field, their proposed solution was to automatically delete 0x2 if it appeared anywhere. I insisted that they alter the code to replace 0x2 with a hyphen.

Eventually, the new version was released, and the problem didn't occur again. I had solved the mystery. This was the stupidest rabbit hole I'd ever gone down, and the most interesting bug I'd ever solved at work.

I am currently unemployed. I chose to quit my job due to mental health complications and I am living off savings. If you have any interesting opportunities for a talented programmer, debugger, and computer generalist, please get in touch! I live in Aotearoa New Zealand.

Read Entire Article