I'm finally back in the saddle working on extensibility and versioning. Looking back, I think that we've done a decent job over the past few years of raising the awareness of "must Ignore unknown" and "must understand" models. In parallel, we're also getting better understanding of how to describe interfaces to applications.
It's time to take the next step in looking at understanding in applications. A good case study is Atom. The Atom working group decided against adding a "mustUnderstand" marker to the Atom language. I think the main reason is that there are various types of Atom applications, and it was too hard to figure out how to "target" the must understand to the right type of application. For example, if an entry has an extension marked mU, does a feed agreggator have to fail if it doesn't understand it? A feed aggregator partially understands entries as it looks at some of the content (particularly the author child), but it doesn't understand all of it. The Atom group also wisely decided it didn't want to formally define processor classes, ie. "aggregator", "entry handler", etc. FWIW, SOAP provides some hooks for this by the ability to use the "role" attribute to target headers at. But very few applications seem to use soap header blocks for application data extensions, let alone the role attribute for targetting to particular nodes.
Many applications deal with partial processing as the document is transferred to one piece of software to another. I regularly use the "Name" example, so imagine that the Name exists in a Medical Record. There may be many pieces of software that operation on the medical record, and even many different pieces that work on the same subset. There might be the "patient info validation" component that uses the Name, and then there is the "Patient info display" that uses the Name, and then "patient query" that also uses the Name. Each of these "Name" processors could be potentially targetted for extensions, the same way it's difficult to target different Atom Entry processors.
What we have is a couple of cases showing the difficulty in partial understanding of xml. This difficulty is by no means limited to Atom. It pervades XML applications, and I think is one of our next big problems to address for achieving distributed extensibility and versioning. Which brings us to the usual questions: what's the problem, is it worth solving, how can it be solved, what's the best way for it to be solved. My intuition is that Schema 1.1 and the PSVI could be part of the solution because I think we'll need reporting on partial validation in order to get partial understanding.
Problem
A strawman problem statement is "How can a language be designed and evolved in the context of a variety of processor types". This problem statement leads us square to the hard parts of the problem, which is how are processor types identified and how is the language subset identified.
Identifying Processor Types
XML processors come in many sizes, shapes and colo(u)rs, ranging from editors to parsers to full blown b2b applications. Trying to figure out at language design time what all the different processors for a language are seems very hard, probably unnecessary, and probably harmful. Imagine that an application decides that there are n different processor types. What happens when innovation of type n+1 comes along? In the case of Atom, what if the original RSS community had decided that there were only "entry processors" and "text viewing processors". They might have accidentally precluded feed aggregators.
I have a feeling that there are 2 extremes for identifying classes of processors. At one extreme is that the classes of processors are identified in the language. Interestingly, XML itself does this by specifying "well-formed" vs "validation" processors. A little more to the middle is where the class of processor can be identified using a token (like soap:role) but the meaning of the token is undefined in the language. The other extreme, is that the class of processor isn't identifiable as an entity at all but rather by the Qnames it understands.
The targetting could be done based upon the language itself. Something like "If you understand entry/content Qnames, then you must understand entry/DaveOsExtension Qnames". This might be usable for the Atom folks to allow text editors and feed aggregators to ignore the DaveOsExtension.
Language subsets
The scenario of expressing understanding based upon Qnames is a simple solution to how to express the language subset that the extension is related to. In the sample scenario, the entry/content Qname suffices for a general class of processors. But what if that Qname isn't sufficient, say it's entry/content where content has attribute foo or bar but not baz AND something else.
The two problems - identifying the processor type and identifying the language subset for a processor type - seem intricately coupled. The processor type is probably defined by the language subset it operates on, and the language subset is determined by what the processor is doing.
A further complication is that an extension could be mandatory for multiple processor types, each with a separate subset of the language that they operate on. There's a lot of potential yuckiness in either: repeating the extension for each processor type/language subset or coming up with a framework for targetting multiple types/subsets for an extension.
Validation and Understanding
Admittedly this is a potentially hard problem, but what building blocks do we have? Let's assume that we are using a schema validator at run-time - I know this is a *big* assumption, but bear with me. Many people have lambasted the PSVI but it provides some *very* interesting pieces of information. In particular, it has the validation attempted on an given XML component, and the results of that validation. So you can find out whether a component was validated successfully or not.
I first encountered this when I proposed that WSDL 2.0 could use this feature of Schema to enable relaxing of the non-determinism constraint. The idea is WSDL 2.0 could use schema in a way that removed any extra xml components that failed schema validation because they had violated the non-determinism content rules, particularly to allow wsdl 2.0 to incorporate the "Must Ignore Unknowns" rule. (W3C Member only link)
If we did want to make any kind of expression of understanding, we could start with validation. We could specify that validating Entry/Content elements is the same as understanding Entry/Content elements.
There's obviously a lot of area to explore as to how to use the PSVI. How does one refer to the PSVI validation attempted/results information items? This could be as simple as using XPath refering to the additions (Entry/Content[@validationAttempted=true&validationSucceeded=true]), or as complicated as a new set of Schema specific XPath functions. I'd think that the PSVI should take into account this usage pattern.
Partial Understanding = Partial Schema?
To associate understanding with validation, it probably also means that implementations need Schema subsets. The Atom specification provides a complete Schema for the Entry. But a feed aggregator only wants to parse a subset of it the Entry. It needs a partial schema for doing the validation.
Right now, we generally build up the Schema of a message to be the composite of all the processors that work on the message. This monolithic view of an interface is both helpful - as it's arguably simpler than expressing multiple schemas/message - and harmful - because it doesn't handle the nuanced kinds of processing that we are talking about.
If we could provide multiple schemas (gee, kind of like XHTML does) for constructs, then we could also provide a processing pipeline where each component gradually adds in the validation information as it is needed.
The breakup and consolidation continues
I've argued for a long time now that XML Namespaces will result in languages that are smaller and smaller because it's easier to version smaller languages than larger languages. The farthest degree of this is where each element or attribute has it's own namespace. We're clearly not there, but some of the WS-* specs are getting pretty close. As the namespaces get smaller, it's obvious that the schemas will get smaller.
And as the namespaces get "stitched" together into composite languages, we need support from Schemas to be stitched together.
Conclusion
I started talking about the difficulties of "must understand" in applications that have partial understanding. This led to thinking about how to identify a subset of a language and how to identify a processor for that subset. We thought about tying the validation logic to the understanding logic, which further led to the need to express the schema for the processor rather than the composite.
I think that expressing partial understanding will cause a need for multiple schemas per name, stronger psvi support for validation using a particular schema, and targetting language extensions based upon the partial validation results.
I'm looking forward to your follow-ups on this topic -- I believe that the ability to implement applications that only have partial understanding of the data sent to them is *the* key benefit of using XML in the first place. Have you considered the use of Schematron for validation in these circumstances?
Stefan, I do think that Schematron is the canonical example of a "2nd" language for many applications that are primarily XML Schema based. I really like the use of schematron annotations to the XML Schema, but I think that james clark's NRL is probably a better general solution.
Thanks, I've never heard of NRL before and will check it out.