A rare situation

Surprise, I didn’t forget about this project!

without comments

Way back in April I talked about a new programming project: extracting and analyzing text from the Codex Seraphinianus. It’s been several months since any updates. Progress has been really sporadic (life happens), but I can now present the next step in my silly past-time.

In my last post I mentioned that the pixel connection algorithm wasn’t perfectly suited to isolating words. One of the problems was that the scan quality produced a lot of “broken” text – cursive lines don’t quite line up and sometimes a single word is carved up into multiple regions. I’ve implemented morphological functions to “close” words and produce better connections, but I haven’t taken the time to sit down and tune the process so that it doesn’t accidentally turn fine loops into dense blocks of black. The other big problem was that the process didn’t group diacritics with the modified word, and it is the solution to this that I’d like to share.

First, let’s understand how the connected components algorithm outputs its information. It starts with an image that looks like this:

The algorithm starts at the upper-left and works down and right, numbering the regions as it encounters them (usually, more on this later):

The challenge now is to attach the diacritic, region 2, to the larger word represented by region 1. To do this, each region gets a bounding rectangle…

…and the following algorithm is applied to all N regions:

for each region X between 1 .. N:
  for each region Y between X + 1 .. N:
    if region Y is fully contained by region X:
      region X consumes region Y

In the example image, region 2 is indeed fully contained by region 1, so it becomes part of region 1. This is good for the Codex script, most diacritics appear close to the letter it modifies and usually within the boundaries of the full word. Applying this procedure to the first page of the Codex yields some pretty good results.

Click for full-size

Still not perfect, though. You can see that there are some diacritics that are clearly fully contained by the parent word, but they aren’t consumed by the word.

It turns out that there are some pathological cases in region numbering. The connected region algorithm does not always correct number regions in a top-down and left-right manner, so the enclosure algorithm listed above doesn’t catch everything. Let’s say that the example image is one of these cases:

Now region 1 doesn’t enclose region 2, and since the algorithm only counts up (for efficiency’s sake), the diacritic isn’t grouped in with the word.

It’s not a great situation, but there’s a fix. Rather than mess around with the connection algorithm and trying to figure out the numbering sequence and where it breaks down – even for a small image, the numbers start getting pretty hard to keep track of in one’s head or on paper – I added an extra step when the bounding rectangles are calculated. Working again in a top-down/left-right fashion, I simply renumber the bounding rectangles as they’re encountered. This fixes the misnumbered cases and doesn’t add complexity for the normal cases, and the result is much better:

Click for full-size

We’re still not at 100% diacritic capture – if you look closely, there are three diacritics that aren’t included with the word that they’re intended to be included with (left of the first word, right of the fourth word, right of the last word). This isn’t another pathological case or anything, those marks actually do not fall completely within the bounding rectangle of their word. Oh well. It’s time to move on to other things – frankly, if I get to the point where I’m statistically identifying language features and a couple missing diacritics make a huge difference, I can go back and tweak things then 🙂

Written by Chris

October 15th, 2009 at 9:44 am

Posted in Development,General

Leave a Reply