Do a Maggie

Monday, 28 July 2025

Page content

The last post literally drove me crazy. Not because of the length, although writing a text of over three thousand words and twenty thousand characters in two different languages is no small feat.

The real problem started when, at some point, the Markdown file of the Italian version of the post got corrupted. Whenever Hugo tried to convert it to HTML, the generated file showed the dreaded replacement character � instead of Italian accented letters. This is that black diamond with a white question mark inside that we have seen in tons of emails and web pages.

It took me hours to fix the issue.

When a Markdown file gets corrupted

My first guess was that the file contained some rogue character, such as like a space that wasn’t really a space, or one of those weird letters that caused quite a stir a few years ago. Trying to hunt them down was a pain, so I did the opposite: I wrote a regular expression to match everything that wasn’t normal, and gradually refined it as I went.

In the end, the regular expression turned into this.

[^a-zA-Z0-9àèéìòùÀÈÉÌÒÙ ,.;:`'"()\[\]\-^/_*#\n]

It was a very down-to-earth regex, but it worked, and that was what mattered. Unfortunately, it didn’t find anything strange.

To narrow down the part of the text that might contain the error, I started removing chunks – first a few lines, then entire sections – rebuilding the trimmed file with Hugo each time. Nothing. Those damn black diamonds just wouldn’t go away!

And that’s when I did the one thing that really made a difference: after trying over and over with no success, I gave up and went to have dinner.

Yes, because when you’re stuck on a problem — whether it’s a tricky math exercise, a program that refuses to run, or the opening of a post that’s been in your head for hours — the best thing to do is give up and do something else. I don’t know about you, but it always works great for me.

And sure enough, after dinner the idea hit me: I did a Maggie, and everything was back to normal.

– Image generated by Google Gemini.

Do a Maggie

Maggie is the nickname of Margaret Secara, who became known for a simple but very effective trick to recover a corrupted Word document: click the ¶ icon in Word’s Ribbon to show formatting characters (if the Word window is narrow, the icon could be hidden inside the paragraph section), then select all the text except the final Enter character (as shown in the image below), and paste it into a new blank document.

If this does not work, you can repeat the process by copying consecutive sections of the text into a new document, in order to isolate the corrupted portion as much as possible.

The first time I used this trick, I was working on a document of over a hundred pages, which had been passed around several colleagues and multiple versions of Word, and had become unmanageable: each new character took seconds to appear on the screen and Word kept crashing. With the trick described above, I fixed the problem in no time.¹

Maggie’s trick works because Word files have a dual structure. The first is what is shown on the computer screen, consisting of text, images, and tables, divided into chapters, sections, and paragraphs, with bold, italics, underlining, page breaks, lists, and indents. All of which is placed somewhat randomly by the user.

The second structure is that of the underlying XML file, which, after a few additions, edits, deletions, and rethinking, becomes a messy tangle of nested XML tags. These, however, remain completely invisible to the user and are never properly cleaned up and reorganized by the program.

Unless we do it manually, copying all the text into a new document, but avoiding like the plague the last Enter character which, for some arcane reason, contains the access key to the hidden nasties within the Word document.

Margaret Maggie Secara may have discovered this trick by pure accident, but it turned out to be so useful that her nickname became a real word, like Kleenex, Band-Aid, Post-it, Tupperware, or Google.

But they are text documents!

However, I write my posts in Markdown, not Word. Even so, starting with a fresh document and copying the contents of the corrupted one into it can be the fastest way to solve many problems.

Of course, this trick works with any type of plain text document, whether it is written in Markdown, LaTeX, HTML, XML, JSON, Org mode… or any other format you can think of, no matter how obscure.

In fact, since a text document hides nothing from the user, unlike Word, there is no need to avoid copying the last Enter character. You can safely select all the text, copy it and paste it into a new, empty file.

Before doing so, it is always a good idea to check that the text does not contain any spurious characters, using a regular expression similar to the one shown above.

And with that, the post is complete. The two sections below are intended for the handful of readers who are curious to know what really happened to me.

Only for the curious…

Why does the trick work?

Once the immediate problem was solved, curiosity got the better of me and I wondered why the method described works with text documents. As is often the case, working from the Terminal helps solve the most difficult problems.

In a regular editor such as BBEdit or TextMate, the corrupted file looks perfectly normal. But if you view it in the Terminal using the cat command, you will immediately notice that all the accented characters are replaced by ?, which is more or less what happens when converting it to HTML.

The file command gives the definitive answer. When I apply it to the corrupted file, file 2025-07-22-macos-tahoe-developer-beta-3.md, I get the following response

Non-ISO extended-ASCII text, with very long lines (1264)

while with any other Markdown file, the result is (as it should be)

Unicode text, UTF-8 text, with very long lines (1264)

In other words, the file had lost the correct UTF-8 encoding, becoming a non-standard plain text file.

Had I noticed this earlier, I could have used iconv to correct the file encoding. However, there is no doubt that the copy-and-paste method is much more practical.

What really happened?

It’s hard to tell, but there are some clues. Since the post was very long, I wrote it in several stages, partly on my Mac Mini and partly on my Air.

I usually copy my work files locally and synchronize them later. But this time I was constantly switching computers, so I preferred to work directly on the synchronized folder (using Syncthing, but I don’t think it was its fault; otherwise I would have found conflicting files, which I didn’t). Instead, there were occasional network glitches, and I guess that these ended up corrupting the file encoding.

What’s the moral of the story? Work locally and synchronize only when you’re done. Or work directly on the network only when the connection is stable.

But when you can’t do that, remember the advice of the good old Maggie.

It seems that it might be possible to recover a Word document without losing track of the changes made. However, the guide uses keyboard shortcuts typical of Word for Windows, so it is not immediately applicable to the macOS version. ↩︎

Do a Maggie

When a Markdown file gets corrupted

Do a Maggie

But they are text documents!

Only for the curious…

Why does the trick work?

What really happened?

Comments

Add a comment

@name