Untitled

Submitted by reeses on Fri, 2002-10-04 03:58.

On our projects, we typically have a few developers working on various portions of online stores, usually delineated by function. That is, we'll have a couple people working on interfaces to outside systems, one or two working on checkout, one working on catalog, one working on registration/accounts, etc. The front end people pound out JSPs, and in the last week or so before QA, we bring it all together and see how it integrates. We don't do pair programming, and we don't synchronise low-level design at the HTML level. So, there's a lot of redundancy.

I've been thinking about writing a program to do tree edit distance analysis of HTML documents to factor all of these pages into parameterised templates. We do that already for larger components -- we have a couple base layout templates, and each developer gets a region of screen real estate for their included content. That works pretty well, but having spent too many years doing OO (and being fantastically lazy besides), I don't like the idea of a thing existing in two places at the same time.

On the subway back to the apartment, I noodled over it a little bit. HTML lends itself rather well to tree modelling, as that's exactly what the documents are. Calculating subtrees, surtrees, and the common factors between trees alone is interesting, and would actually be easy to implement by flattening the trees and doing list comparisons. Taking it a step further, calculating the edit distance for labeled nodes (or tags, in this case) would enable us to find common structure, and would enable us to fuzzy match. If a developer tended to wrap text in bold tags, while another did not, but they both displayed data in tables of similar structure, the edit distance between subtrees would be small. Create a template that takes the text in, and boom -- less HTML, shorter files, more happiness.

I have the feeling this would be dog slow, however. Would there be a way to calculate a tree checksum that is a deterministic diminutive representation of the tree, for quick comparison with other trees? FFT?

Whenever I get one of these ideas, I have the nagging feeling that someone should have written it already, and I peruse freshmeat, google, citeseer, etc., and look for an implementation. Remember, I'm lazy. I've found a couple papers that will be helpful, and I have the nagging suspicion I could build something really easily with xpath, but no implementations yet. I'll get bored after the project ends and whip something up in Ruby one night.

Wish me luck. If you've seen anything, please let me know.

Post new comment

Captcha Image: you will need to recognize the text in it.
Please type in the letters/numbers that are shown in the image above.