I'm scratching my head over the "standardisation" process for document file formats, in particular the text and form documents. Open Document Format and Office Open XML pitched against each other at Ecma, much time and resources spent.
Why bother at all?
A form or a document consists of two things - content and layout. Nay, many pieces of content and then layout. Information mashed up into one, a text document or a text form. An information storage and distribution methodology more or less unchanged since the days of the Pharaohs. Papyrus scrolls, Gutenberg prints or OOXML formats - same stuff, new wrapping.
Quite practical in pre-IT days, but it has a drawback: The moment you manipulate data it loses value and is harder to find, link and reuse. Storing data in manipulated form is simply put, not smart. Nope, it's plainly stupid if alternatives exists.
A letter contains finite information objects, each easily stored using some good old IT standard, say ASCII.
It has a body text, something that could easily be split in finite objects like Introduction, Argument, Closing and whatever you'd like - just name the objects and define the relationships.
And then it has an Address, certainly a finite object if there ever was one. How many "documents" has your address on it? Hundreds? Thousands? How many misspells would there be within that mass? Tens? Hundreds? How many letters do you have to send out when you move? Hundreds? How many letters have never reached you?
Why not have one single "address" object representing that house then linked to any other information object pertaining to the house or you as long as you live in that physical object? One single link to update, no room for errors, no mess.
For textual information I am not interested in stored layout, layout I can apply when I need it thank you. I'm interested in pure information. And most probably a single data object therein and of course how and what it links to.
A "Patient Journal" is a more elaborate "text form" but it consists of finite objects that arises at separate occasions - an inoculation, a surgery, a round of medication, a broken arm. And truth be said the health industry would be better off with finite objects than the form - thus researches can be allowed to see whatever they should see but not more without signing an NDA.
If I want to send a letter, if I need a report to be distributed, if I need a patient journal to be studied in it's entirety I simply apply some logic to the finite objects; start with this object, then add that, then those - and print it out with nice margins and the font of the day.
Keep data in raw form, apply logic when needed, never save manipulated data. Full stop.
Stop this "text document" and "text form" thinking now, time to move onward from the papyrus-scroll-concept and stop bothering with standards for manipulated data. That is bad form and quite foolish.
Great post, Sig. It's amazing to see how hard it is to break from traditional thinking when things have been done the same for thousands of years.
-ewH
Posted by: ewH | April 12, 2007 at 01:06
Surely it is necessary to tag data in some way to ensure that you can draw the correct information, collate it into a document, and format?
Maybe (probably) my understanding is wrong, but aren't these formats just about tagging information in a "standard" way? TeX is probably the best example of tagging raw data and then converting into a document.
Although your underlying philosophy (objects) may be different to the "format a document" philosophy, how divergent are the two in reality?
Posted by: Duncan Drennan | April 12, 2007 at 08:00
Thanks Eddie,
maybe not so surprising given that we've had IT based text document handling for say about 30(?) years out of 2-3,000 total. Still upon time to revisit the purpose of it all and see if it could be done better I say :)
Posted by: sig | April 12, 2007 at 08:17
Duncan,
if the only use of the information was for producing a "Document" then obviously TeX, HTML, RTF and such would be best - they do exactly as you say "tag" the sections, paragraphs etc in a very efficient way.
Therein lies the diff - I'd like to use the information for more and cross-documents too - that's when those limited-to-one-file "tags" would not work any more.
If you look closer at a document/form you'll always find information-objects that are finite (address is a good example) - rip'em out of the form, let them be finite objects and use it anywhere. That way you only need one of them, not hundreds (source of errors, much work and reconciliation!).
But you're right - the "address object" would have to have "relationship" metadata relating it to other objects so you can in fact use it for anything and all.
There would be no limits to what method you'd choose, actually the more the merrier as "relationship between objects" equals knowledge!
With thingamy we have the following mechanism that can be used for any object representing some real world or virtual object: Linking (two-way direct link between any objects, typically between father and son, me and house), Tags (very generic, a bit imprecise but increasingly precise with volume of tags per object, typically for a person/employee - speaks Italian, C++, likes to travel), Objects as properties (a bike could have specific frame and wheelset), Inheritance (build classes on other classes) and Places (widget is in warehouse B).
Example - Address-object: Using the finite and persistent approach it would not be a part of you, it represents the house and could be linked to you. You again would be linked to an "insurance-object" and a "bank-account-object". When you move you unlink yourself from old house and link to new house - thus all insurance and bank account objects would know where to contact you in one simple stroke.
Back to "Document" production - being the insurance company and I'd like to print an insurance document for a client I'd apply a template that will pick up the insurance-object, the person-object linked thereto and the address-object linked to that linked person, arrange the properties of said objects and slap my insurance company logo on top, set margins and Arial and print.
Another template could use all insurance-objects filtered by some tag, not use the person-objects but add the invoice-objects linked to those insurance-objects and voila a neat report ensues.
Etc. :)
Posted by: sig | April 12, 2007 at 08:43
Sig, I totally agree with you—I hate replication, it normally leads to errors (and not just a few...)
I suppose my point was really that the two are not that different, but the levels that they work on are. Your approach (which I really like) is a macro approach, where the "tag document for formatting" approach is really like micro-managing the data.
There should only be one instance of any particular piece of data (like a company address, there is only one registered address, etc.)
Basically the data is then formed into a document by a style-sheet....but I guess that is pretty much what thingamy does :)
Posted by: Duncan Drennan | April 12, 2007 at 17:31
Duncan, prexisely!
Three rules:
1) One data-object per real-world object.
2) Split into smallest possible objects.
3) Save only raw data.
and of course, the fourth (the SOX etc. bonus ) rule:
4) Never lose anything. :D
Posted by: sig | April 12, 2007 at 17:56
The real trick (I imagine) with thingamy, is to define the objects well...I'm guessing that that is where it can get tricky—especially with intangible things, likes ideas, designs and so forth. The issue of intellectual capital is the one that I keep coming back to...(4) Never lose anything
Posted by: Duncan Drennan | April 13, 2007 at 08:27
Duncan, you're absolutely right on that one!
Every time in fact, a real brain-teaser but a good one - pushes me to rethink my modeling on a very basic level. Often when I decided for one I end up atomising it even further... sometime reverting to a practical middle-of-the-road :)
Think I have a follow-up post brewing...
Posted by: sig | April 13, 2007 at 08:39
Just thinking aloud here with a simple example...
Consider that you have a purchasing application with vendors, materials and stuff.
Vendor's bank account number changes. So, instead of updating the bank account field on the vendor "form", you'll do this:
1. Create a new instance of the bank account object. It contains the new bank account number.
2. Remove the link between the vendor and the old bank account instance.
3. Link the vendor to the new bank account instance.
Is this the way how it goes on Thingamy?
Posted by: Tomi Itkonen | April 15, 2007 at 09:43
Hi Tomi,
(sorry for the belated response, seems the Typepad mail had hiccups)
Basically yes - and you are touching upon the issue of "where's the border" between my world and the rest.
In the perfect data system all would of course be represented as it is in reality. And there the instance of a bank account would be created only once when it was created at the bank.
Rest would be as you say - unlink, link to another bank account.
When it's hard to represent "everything" you have to make practical choices - i.e. have the bank account number as a property field in a customer-object and change that. Old-hat style :)
Thingamy has no rules! It's just built so you'll get more and more out of it if you get as close to modeling reality as close as possible... still it can behave like any old application (quirky yes, not as smooth definitely, but still)
:D
Posted by: sig | April 16, 2007 at 20:56