Zoe

Submitted by reeses on Thu, 2002-10-10 05:04.

Goose writes blog entries of disturbing length. Disturbing, because I can never muster the effort to type that much, usually giving up and deleting the last paragraph or two. If I'm lucky, I used my rusty "inverted pyramid" skills from Jr High, and I made enough sense to stop at any time and still make sense when I reread my post three months later.

I made another effort at digging through the Zoe code, but it's slow going. I have a mountain of coding to do at work, yet somehow, teasing the yarn from this sweater is what I find myself doing. I'm really wondering what IDE was used to build this -- I have no idea how it compiled. There's a class that implements two of its inner interfaces. Now, it appears it has confused IDEA into a frozen loop. Tough stuff!

So, let's sketch out the pieces that I would need to steal or build to implement equivalent technology.

  • Parse email messages into structured headers and body.
  • Pick out keywords, discarding common terms. Match using proximal Levenshtein distance? Phrases?
  • Tag keywords with hyperlinks.
  • Render HTML version of email messages.
  • Retrieve appropriate emails based on a click action.
  • Summarise groups of emails (by any clickable field)
  • Extract structured strings like URLs, emails, etc.
  • Search message store.
  • Authentication for web browsing.

I don't think I need a POP/IMAP client or server, because I could feed the engine with procmail, taking advantage of bogofilter and SpamAssassin. I am perfectly happy having two message stores -- the IMAP structure I use with mutt and Outlook Express, and the searchable store I browse with a web browser.

I'm debating having a folder hierarchy automatically generated, perhaps similar to that employed by Endeca. It's tempting, but I'm very particular about data grouping. I'm sure it would take me a long time to get happy with it. Did I mention I'm fatally lazy?

Anyway, I could probably spin up the kernel of this in Ruby over a weekend. I'd probably be able to do the message parsing, indexing by email address, and storage into a database right away. A web interface would take a day or so, for simple search and rendering of lists. The next bits would be the fun bits -- pulling out keywords while ignoring chaff, and building indexes for those keywords. A word that might be unique shouldn't be tagged, but if three or more emails contain that word, a link axis should be created on that keyword through those messages. I need to come up with a way to ignore common words, though -- I don't want every bloody word tagged. Articles are easy, names are harder. I'd have to add training to the requirements list -- type a word or phrase into a search bar, and that phrase would be a new axis between the existing documents. How long would it take to rebuild this index?

It's a good thing I'm both lazy and swamped by work that pays me. I won't have to worry about this stuff. I'll just wait and watch Freshmeat until someone does Zoe right. :-)

Post new comment

Captcha Image: you will need to recognize the text in it.
Please type in the letters/numbers that are shown in the image above.