Validation by Projection
Many of the architectures and strategies for validation apply validity checking to a particular document with a pass or fail result on the document. This assumes that the schemas used in validation are expressive enough for all the potential versions of documents including any extensions. We’ve regularly seen that the Schema 1.0 wildcard limits the ability for fully describing documents. For example, it is impossible to have a content model that has optional elements in multiple namespaces with a wildcard at the end. The choice is to either have the wildcard or the elements.
There is another approach to validation, called validation by projection, which effectively removes any unknown content prior to validation. It is validation of a projection of the XML document, where the projection is a subset of the xml document with no other modifications to the contents including order.
Using our regular Name example that defines a family and a given as children of personName, an example document with
<personName>
<family>Orchard</family>
<middle>Bryce</middle>
<given>David</given>
</personName>
We see that there is a middle element that is not known. As our personName example does not allow the middle, validation of the document would fail.
A projection of this document is:
<personName>
<family>Orchard</family>
<given>David</given>
</personName>
We immediately see the benefits of validation by projection, which is that extra content is ignored for validation. Many more documents that are intended to be valid will be se under projection by validation. Validation by Projection is an implementation of the Must Ignore Unknown or Must Accept Unknown rules.
There are some complexities to the projection. The projection could ensure that unknown attributes are removed but it might need to ensure any xml: namespaced attributes such as xml:base and xml:lang are preserved. There are constraints that the projection perhaps should preserve, such as only projecting family elements that are children of personName and ignoring family elements that are children of any other element.
Projection Algorithm
Part of validation by projection is determining what to project. Our name example was fairly straightforward because there is a complexType definition with only two child elements. The simplest rule for determining what to project is:
Starting at the root element, project any attributes and any elements that match elements in the content model of the current complexType and recurse into each element.
This very simple rule ignores complexities, such as excluding attributes, elements that match wildcards, and global element definitions.
The crucial aspect of validation by projection is that the Accept Set under validation by projection is a superset of the Accept Set without validation by projection. In general, the larger we can make the Accept Set, the greater the chance for versioning. Because the validation by projection Accept Set is potentially a superset of the Accept Set that can be specified by XML Schema, validation by projection will generally allow more languages changes to be compatible changes compared to XML Schema.
With validation by projection it's even possible to go beyond elements and attributes - into the content. In most cases this won't be useful, but for enumerations (code lists etc.) it's possible to retain an element if it contains a known value, and remove if it contains an unknown value.
There is a common programming style which does do validation by projection automatically - parse the XML and extract all necessary elements with XPath expressions. That ignores unknowns automatically.