A rare situation

Oh boy, another programming project!

with 4 comments

Since shaim is over, I’ve found myself with rather a glut of time on my hands. I’ve been reading a lot more (alternating between the Dune series and the Dresden Files series) but I still have the desire to work on programming stuff in my spare time. This time around I’ve decided to combine my love of strange books with my interests in linguistics and computational analysis.

I’m guessing that aren’t many people reading this who know what the Codex Seraphinianus is. From Wikipedia:

The Codex Seraphinianus is a book written and illustrated by the Italian artist, architect and industrial designer Luigi Serafini during thirty months, from 1976 to 1978. The book is approximately 360 pages long (depending on edition), and appears to be a visual encyclopedia of an unknown world, written in one of its languages, a thus-far undeciphered alphabetic writing.

It’s rather rare (read: expensive) and isn’t currently published in the US, but I happened to get a hold of a copy via Courtney’s connection to the library at the University of Illinois. The art is beautiful, and the language intrigues me. Linguists have worked at figuring out this language since the book was first published – for all anyone knows, it’s complete nonsense. I’m not vain or stupid enough to think that I can succeed where trained linguists have not, but I decided I’d try my hand at learning about image processing and give the deciphering a crack myself. What can it hurt?

The first step to all of this ballyhoo is to extract words from the book. Recently I came across some fairly high-quality scans of the Codex. Here’s an example of the writing:
Writing from the Codex Seraphinianus
Writing from the Codex Seraphinianus

The scans are full-color and you can see where images on the opposing page have shown through. The first thing I did was write a quick program to convert the images to black-and-white. This removes a lot of the noise, smallerizes the image files, and gives analysis algorithms an easier time of figuring out what’s where.

The same paragraph in 2-color format
The same paragraph in 2-color format

The next step is to figure out where words are on the page. I’ve implemented a connected components algorithm that identifies connected regions on an image.

The same paragraph with connected regions colored
The same paragraph with connected regions colored

That’s all my progress as of last night. The algorithm isn’t perfect yet, I’ll have to tweak its performance to find regions that should be connected but aren’t because of quality issues in the black-and-white format. You can also see that diacritic marks aren’t grouped to the word that they belong to, so the next phase of processing will involve grouping regions that fall within a bounding area.

I’m not entirely sure what comes next, at least as far as extracting the text goes. This is a learning experience for me. Further bulletins as events warrant 🙂

Written by Chris

April 16th, 2009 at 9:10 am

Posted in Development,General

4 Responses to 'Oh boy, another programming project!'

Subscribe to comments with RSS or TrackBack to 'Oh boy, another programming project!'.

  1. […] back in April I talked about a new programming project: extracting and analyzing text from the Codex Seraphinianus. It’s been several months since any updates. Progress has been really sporadic (life happens), […]

  2. […] sursa […]

  3. […] sursa […]

Leave a Reply