Providing Compatible Schema Evolution

David Orchard

Jan 19th 2004

Introduction

This paper examines current solutions and number of possible changes to current Web technology to enable a simple use case of evolving an xml language in a loosely coupled manner while retaining validation of new and old instances using new and old schemas. In the Versioning XML Languages article, I describe a set of rules that enable schemas to evolve in forwards and backwards compatible ways without requiring changes on both senders and recievers yet retaining the ability to validate newer schemas if they are available. But even this technique suffers from a number of shortcomings. This paper will elaborate on the problems with the advocated approach and a description of a large number of potential solutions, ranging from no changes in XML Schema to fairly radical changes in schema.

The article starts with a simple use case, then surveys existing solutions and the problems associated with them. Then it examines a number possibilities:

We start with a simple use case of a name with a first and last name, and it's schema. We will then evolve the language and instances to add a middle name. The base schema is:

<xs:complexType name="nameType"> <xs:sequence> <xs:element name="first" type="xs:string" /> <xs:element name="last" type="xs:string" minOccurs="0"/> </xs:sequence> </xs:complexType>

Which validates the following document:

<name> <first>Dave</first> <last>Orchard</last> </name>

And the scenarios asks how to validate documents such as the following where the new schema with the extension is available or not available to the receiver.:

<name> <first>Dave</first> <last>Orchard</last> <middle>B</middle> </name> <name> <first>Dave</first> <middle>B</middle> <last>Orchard</last> </name>

Current solutions

There are uses of wildcards and type extension that could be used to enable validation of the instances and creation of a new schema. Roughly speaking, the solutions today for adding an element are:

Type extension

Use type extension or substitution groups for extensibility. A sample schema is:

<xs:complexType name="NameExtendedType"> <xs:complexContent> <xs:extension base="tns:nameType"> <xs:sequence> <xs:element name="middle" type="xs:string" minOccurs="0"/> </xs:sequence> </xs:extension> </xs:complexContent> </xs:complexType>

This requires that both sides simultaneously update their schemas and breaks backwards compatibility. It only allows the extension after the last element.

Change the namespace name or element name

The author simply updates the schema with the new type. A sample is:

<xs:complexType name="nameType"> <xs:sequence> <xs:element name="first" type="xs:string" /> <xs:element name="middle" type="xs:string" minOccurs="0"/> <xs:element name="last" type="xs:string" minOccurs="0"/> </xs:sequence> </xs:complexType>

This does not allow extension without changing the schema, and thus requires that both sides simultaneously update their schemas. If a receiver has only the old schema and receives an instance with middle, this will not be valid under the old schema..

Use wildcard with ##other

This is a very common technique. A sample is:

<xs:complexType name="nameType"> <xs:sequence> <xs:element name="first" type="xs:string" /> <xs:any namespace="##other" minOccurs="0" maxOccurs="unbounded"/> <xs:element name="last" type="xs:string" minOccurs="0"/> </xs:sequence> </xs:complexType>

The problems with this approach are summarized in Examining elements and wildcards as siblings. A summary of the problem is that the namespace author cannot extend their schema with extensions and correctly validate them because a wildcard cannot be constrained to exclude some extensions.

Use wildcard with ##any or ##targetnamespace

This is not possible with optional elements. This is not possible due to XML Schema's Unique Particle Attribution rule and the rationale is described in the Versioning XML Languages article. An invalid schema sample is:

<xs:complexType name="nameType"> <xs:sequence> <xs:element name="first" type="xs:string" /> <xs:any namespace="##any" minOccurs="0" maxOccurs="unbounded"/> <xs:element name="last" type="xs:string" minOccurs="0"/> </xs:sequence> </xs:complexType>

The Unique Particle Attribution rule does not allow a wildcard adjacent to optional elements or before elements in the same namespace.

Extension elements

This is the solution proposed in the versioning article. A sample of the pre-extended schema is:

<xs:complexType name="nameType"> <xs:sequence> <xs:element name="first" type="xs:string" /> <xs:element name="extension" type="tns:ExtensionType" minOccurs="0" maxOccurs="1"/> <xs:element name="last" type="xs:string" minOccurs="0"/> </xs:sequence> </xs:complexType> <xs:complexType name="ExtensionType"> <xs:sequence> <xs:any processContents="lax" minOccurs="1" maxOccurs="unbounded" namespace="##targetnamespace"/> </xs:sequence> </xs:complexType>

An extended instance is

<name> <first>Dave</first> <extension> <middle>B</middle> </extension> <last>Orchard</last> </name>

This is the only solution that allows backwards and forwards compatibility, and correct validation using the original or the extended schema. This articles shows a number of the difficulties remaining, particularly the cumbersome syntax and the potential for some documents to be inappropriately valid. This solution also has the problem of each subsequent version will increase the nesting by 1 level. Personally, I think that the difficulties, including potentially deep nesting levels, are not major compared to the ability to do backwards and forwards compatible evolution with validation.

However, this set of solutions does not appear to be fully satisfactory. Let us examine in a wide variety of potential solutions.

New validation model: Projection

As described in the article, compatibility is directly related to the Must Ignore rule. In some cases, we wish to allow any extensions and only constrain those we know about. From the article, "The problem with this last approach is that with a specific schema it is sometimes necessary to apply the same schema in a strict or relaxed fashion in different parts of a system. A long-standing rule for the Internet is the Robustness Principle, articulated in the Internet Protocol [3], as "In general, an implementation must be conservative in its sending behavior, and liberal in its receiving behavior". In schema validation terms, a sender can apply a schema in a strict way while a receiver can apply a schema in a relaxed way. In this case, the degree of strictness is not an attribute of the schema, but of how it is used. A solution that appears to solve these problems is to define a form of schema validation that permits an open content model that is used when schemas are versioned. We call this model validation "by projection", and it works by ignoring, rather than rejecting, component names that appear in a message that are not explicitly defined by the schema. We plan to explore this relaxed validation model in the future."

Given the previous example of a schema, the receiver could apply a more relaxed schema that applies the mustIgnore rule. What we need is the restrictive non-extensible schema, and then a validation mode that allows elements that are not known to exist. This validation mode applies the schema to any elements by matching names. If a name does not exist in the schema, it is ignored. If the name does exist in the schema, it is validated against the schema. From a validation perspective, unknown elements are "projected" out of the instance for the purposes of validation. This is an application of the "Must Ignore" rule - rule #5 in the Versioning XML vocabularies - to the schema validator. David Bau provides excellent material on this in his Theory of Compatibility part 3. It is effectively an implicit wildcard with ##any before and after each element.

There is a need to configure which kinds of elements are "projected" out for the purposes of validation. A solution is a flag that specifies what kinds of elements are retained, such as "Validate only known elements", "Validate all elements".

This permits extensions in any namespace, anywhere in the instance. It does not require the schema author place wildcards throughout their schema. It allows the schema author to change the schema definition, while retaining the backwards and forwards compatibility. Thus it seems very well suited for the goals of extensibility and versioning.

This could be deployed in today's environment. It requires that all receivers and any senders of extended documents use a validator that performs validation by projection. Continuing our example, imagine that names are being exchanged. The schema could be written in a non-extensible manner. The sender could use a regular validator and things would behave as expected. A sender that introduces new constructs would use the newer validator, or a private updated schema. The receiver would use the newer validator. This kind of deployment is sound in many environments, particularly where there are many different senders and fewer receivers.

Particle Attribution

One significant problem with XML Schema and evolution using wildcards is the unique particle attribution rule. This rule and the limitations are described in the versioning article, so they won't be reprised in their full. Essentially, a wildcard cannot be after an optional element and cannot be before an element in the same namespace.

The article suggests "A less restrictive type of deterministic model could be employed, such as the "greedy" algorithm defined in the URI specification [4]. This would allow optional elements before wildcards and removing the need for the Extension type we introduced. This still does not allow wildcards before elements, as the wildcard would match the elements instead. Further, this still does not allow wildcards and type extension of the type to coexist. A "priority" wildcard model, where an element that could be matched by a wildcard or an element would match with an element if possible would allow wildcards before and after element declarations.".

The problem identified above is how to allow wildcards after optional elements. It would allow greater flexibility if one could write a schema such as

<xs:complexType name="name"> <xs:sequence> <xs:element name="first" type="xs:string" /> <xs:element name="last" type="xs:string" minOccurs="0"/> <xs:any namespace="##any" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType>

A suggestion is XML Schema wildcards could use a "greedy" algorithm whereby the first possible occurance of a definition is matched.

In this example, because the "last" element definition occurs before the wildcard, an occurance of a "last" element in a document should be attributed to the "last" element definition, and any subsequent non-"last" elements are attributed to the wild card. This allows us to now express the previous schema as a valid XML Schema.

However, it does not allow us to write the following schema:

<xs:complexType name="name"> <xs:sequence> <xs:element name="first" type="xs:string" /> <xs:any namespace="##any" minOccurs="0" maxOccurs="unbounded"/> <xs:element name="last" type="xs:string" minOccurs="0"/> </xs:sequence> </xs:complexType>

The problem is that a document containing a "first" would "greedily" match to the wildcard. Further, we still can't write a schema such as

<xs:complexType name="name"> <xs:sequence> <xs:element name="first" type="xs:string" /> <xs:element name="last" type="xs:string" minOccurs="0"/> <xs:any namespace="##any" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> <xs:complexType name="ExtendedName"> <xs:complexContent> <xs:extension base="tns:name"> <xs:sequence> <xs:element name="middle" type="xs:string"/> </xs:sequence> </xs:extension> </xs:complexContent> </xs:complexType>

because of the same problem. In this case, an occurance of "middle" in the document would be greedly matched to the wildcard in name. Readers are reminded that this is an illegal schema in today's XML Schema.

The second suggestion is that XML Schema wildcards could be lower priority that element definitions. If an element definition and a wildcard are siblings, then the element definition would match if it could. This can be considered a modification of the "greedy" algorith, where element definitions are "greedy" compared to wildcards. Using this change to xml schema, we could write schemas that allow wildcards before and after element definitions.

Allowing undefined elements only

One of the problems with using the wildcard for extensibility is that it defines the allowable names by namespace name. This may be too coarse an axis to choose. In many cases, when we define elements, we want to ensure that the element rules are matched whenever the elements occur. For example, we want the following document to be invalid:

<name> <first>Dave</first> <last>Orchard</last> <first>Dave</first> </name>

The article said "Additionally, a wildcard that only allowed elements that had not been defined -- effectively other namespaces plus anything not defined in the target namespace -- is another useful model. These changes would also allow cleaner mixing of inheritance and wildcards. ".

The issue raised is that the wildcard does not have a model that says only undefined elements are not allowed. One solution would to have a new attribute on wildcards that specified a subset of the elements allowed. Copying the namespace attribute into an "element" attribute, the following schema could be used to define an extensibility point that allowed only unknown elements in the targetnamespace:

<xs:complexType name="name"> <xs:sequence> <xs:element name="first" type="xs:string" /> <!-- Note the currently illegal element attribute --> <xs:any namespace="##targetnamespace" element="##other" minOccurs="0" maxOccurs="unbounded"/> <xs:element name="last" type="xs:string" minOccurs="0"/> </xs:sequence> </xs:complexType>

Namespace Name variability

Even assuming that one could add optional defined elements or specify excluding unknown elements, there is a mismatch between expectations of namespace names for language owners and for extension owners. A language designer currently has no way of separating out the extensions that they may want to provide versus those that others may provide. The granularity of the namespace attribute in wildcards is: ##other, ##targetnamespace, ##any, or a particular namespace name. Let us make a few observations before making a suggestion. A language designer for foo.com will "know" that any subsequent version that they create will be in foo.com's namespace. And they probably have an algorithm for specifying the URI for the namespace name. An example might be http://www.foo.com/ns/2004/01/PO. They guarantee by their URI assignment that any subsequent versions will increase in date and will contain a PO.

If they could specify that a namespace name of "http://www.foo.com/*" was allowed in a wildcard, they could reserve a space of names for their own use. This allows them to avoid UPA constraints on ##any, while restricting ##other to a subset that they control. A corrollary would be that they could then provide a wildcard that excluded their domain name for the express purpose of allowing others to extend. This might looking something like an expression with a "-" sign to indication subtraction from the allowable set, as in namespace="##other -http://www.foo.com/*"

For example, if a schema author wanted to allow only themselves to extend a schema and exclude others, the above suggestion a schema that looks like:

<xs:complexType name="name"> <xs:sequence> <xs:element name="first" type="xs:string" /> <!-- Noe the namespace attribute wildcard --> <xs:any namespace="http://www.foo.com/*" minOccurs="0" maxOccurs="unbounded"/> <xs:element name="last" type="xs:string" minOccurs="0"/> </xs:sequence> </xs:complexType>

From the namespace owners perspective, they have the option of using new namespaces for extensions and excludes third parties from inserting into the reserved space.

It is consistent with Web architecture for a URI authority to publish rules for how to construct URIs.

There may be other mechanisms that are better suited for doing a form of URI matching.

Cumbersome wildcard syntax

A significant problem with wildcard constructs is that it requires the author to either: place wildcards in ever possible extensible spot, or to guess where the extensibility will occur.

Imagine that we really want to allow the following document to be valid:

<name> <title>Mr</title> <first>Dave</first> <middle>b</middle> <last>Orchard</last> <suffix>I</suffix> </name>

We would need a schema something like:

<xs:complexType name="name"> <xs:sequence> <xs:any namespace="##any" minOccurs="0" maxOccurs="unbounded"/> <xs:element name="first" type="xs:string" /> <xs:any namespace="##any" minOccurs="0" maxOccurs="unbounded"/> <xs:element name="last" type="xs:string" minOccurs="0"/> <xs:any namespace="##any" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType>

This is not very human readable, though it does allow extensibility before, between and after elements. This problem is described in the article as "But that still means that the author has to sprinkle wildcards throughout their types. A type-level any element combined with the aforementioned wildcard changes is needed. One potential solution is that the sequence declaration could have an attribute specifying that extensions be allowed in any place, then a commensurate attributes specifying namespaces, elements, and validation rules."

This solution might look something like

<xs:complexType name="name"> <xs:sequence extensible="true" namespace="##any" processContents="lax" minOccurs="0" maxOccurs="unbounded"> <xs:element name="first" type="xs:string" /> <xs:element name="last" type="xs:string" minOccurs="0"/> </xs:sequence> </xs:complexType>

To be precise, we know from the unique particle attribution rule described earlier that the example with the wildcards strewn throughout isn't a valid XML schema syntax, so we use our advocated technique, which yields a schema something like:

<xs:complexType name="name"> <xs:sequence> <s:element name="Extension" type="ExtensionType" minOccurs="0" maxOccurs="1"/> <xs:any namespace="##other" minOccurs="0" maxOccurs="unbounded"/> <xs:element name="first" type="xs:string" /> <s:element name="Extension" type="ExtensionType" minOccurs="0" maxOccurs="1"/> <xs:any namespace="##other" minOccurs="0" maxOccurs="unbounded"/> <xs:element name="last" type="xs:string" minOccurs="0"/> <s:element name="Extension" type="ExtensionType" minOccurs="0" maxOccurs="1"/> <xs:any namespace="##other" minOccurs="0" maxOccurs="unbounded" /> </xs:sequence> </xs:complexType> <s:complexType name="ExtensionType"> <s:sequence> <s:any processContents="lax" minOccurs="1" maxOccurs="unbounded" namespace="##targetnamespace"/> </s:sequence> <s:anyAttribute/> </s:complexType>

Now this is even less readable. And it doesn't even show allowing attributes! However, using Schema and validators as they exist today, this is the only way of achieving backwards and forwards compatible changes with extensibility throughout the type.

Active versus Passive extensibility

The wildcard construct requires that a schema author take action to enable extensibility and versioning. Each author must plan for extensibility and "do something". They are required to be active and take explicit action. Contrast this with the Web. The provision for extensibility in HTML and HTTP headers is part of the protocol specification. Given an HTML or an HTTP stack, an author that wants to extend the language does not have to take any action to enable that extensibility. The extensibility was enabled in the underlying specificaitons. XML changes this substantially, because it enables arbitrary languages. However, extensibility can still be provided as part of the infrastructure for XML authors. It is intriguing to imagine how passive extensibility could be provided for XML language authors. One approach is that Schema could support a default mode for types would be to allow extension. Another approach is to provide a different mode of validation, such as Validation by Projection.

Overriding Must Ignore

All of the previous suggestions assume that any content that is unknown must be ignored. The namespace owner can easily ensure that any new content is understood by changing the namespace name or element names. But there is a need in many cases for an extension author to indicate that their extension cannot be safely ignored. They need the equivalent functionality of changing the required namespace name or element name, but they cannot modify the top level elements. This has led to the creation of "mustUnderstand" flags such as SOAP. They over-ride the default must Ignore rule. Interestingly, we can observe that XML Schema has a processContents attribute which says whether the wildcard requires validation or not. Given the Must Ignore rule, the typical value is "lax". There is another value, which is "strict". One way of specifying that an extension is required to be processed, or at least valid, is to allow for this attribute in the instance. This notion of instance level must Understand is very similar to the xsi:type attribute which allows instance level typing. An example of a name with a required middle is

<name> <first>Dave</first> <!-- note introduction of a new schema attribute for instances --> <middle xsi:processContents="strict">Bryce</middle> <last>Orchard</last> </name>

Extensions in the Middle of Types

The example of type extension showed a problem that extensions that are intuitively in the middle of a type get "shoved" to the end. Mulitple chains of extensions accentuate this problem. Even in tightly coupled systems, or using type extension that is loosely coupled, it would be very useful to be able to do extension in the middle of types. One idea is to specify an optional schema attribute of "before" that is the name of an element the extension is before. The default could be empty, which means at the end. An example of this:

<xs:complexType name="ExtendedName"> <xs:complexContent> <xs:extension base="tns:name"> <xs:sequence before="tns:last"> <xs:element name="middle" type="xs:string"/> </xs:sequence> </xs:extension> </xs:complexContent> </xs:complexType>

While extension at the end does not appear burdensome in toy examples like this, there are many industry schemas that have dozens of elements and have multiple extension levels.

Backwards compatible extension types

The web is a distributed network space, and therefore there is no centralized control or administrator. Therefore a variety of techniques have been created to allow evolution in nodes in the web without require all other nodes to change. The crucial requirement is that one node can change and retain compatiblity with another unchanged node. That means that the unchanged node is untouched. This mode of operation has typically meant that type derivation techniques that require receivers of new instances also have the new type, such as XML Schema's type extension mechanism, are difficult to deploy over the web. Solutions that do not require a "touch" on both sides are more widely used in distributed environments, and XML Schema provides a wildcard for this functionality.

However, it seems like a backwards compatible type derivation could be created. The instance would have to have the "old" type information, as well as the new. We see below an instance of a new type that illustrates this:

<!-- note the xsi:baseType addition--> <improvedName xsi:baseType="tns:nameType"> <first>Dave</first> <last>O</last> <middle>B</middle> </improvedName>

This shows a simple manner for type derivation to be used in a backwards compatible manner. This has the problem that extension must be done at the end of the instance.

Third Party Mulitiple optional defined elements

Finally, we arrive at a very common use case, that of multiple optional elements, developed independently. The versioning article describes this problem as "there is still the unmet need to define schemas that validate known extensions while retaining extensibility. An author will want to create a schema based upon an extensible schema but mix in other known schemas in particular wildcards while retaining the wildcard extensibility. We encounter this difficulty in areas like describing SOAP header blocks".

Imagine that 2 extensions, middle name and suffix have been created by a third party, such that the following document should be valid

<name> <first>Dave</first> <middle>Bryce</middle> <last>Orchard</last> <suffix>II</suffix> </name> But the following document should be invalid because the middle name is in the wrong place.

<name> <first>Dave</first> <middle>B</middle> <suffix>II</suffix> <last>O</last> </name>

There is no way of writing schema that extends a wildcard with constrained elements yet allows the extensibility to continue for the remaining unknown elements.

And less you think this isn't that common a use case, the SOAP specification uses a wildcard for SOAP headers. One cannot provide a schema that constrains SOAP headers yet retain extensibility. WSDL does allow a Web service author to constrain the mandatory SOAP headers by creating a message construct, but it still doesn't allow one to express optional header constraints. Further, there isn't a way to exclude certain headers. A use case is where one version of a header is supported but another isn't, ie V2 of a security header is supported and optional but V1 isn't supported nor allowed.

Clearly, this notion of optional elements with extensibility is common, but is yet unserved in XML Schema. It seems that there needs to be a schema construct that allows extension of an external schema's wildcard (ie the schema author can't extend the SOAP specification's namespace) with defined types and listing disallowed types. I have not yet thought of a solution that enables this to be done in an easy manner.

"

Extensibility versus Versioning

When I started thinking about extensibility and versioning, I implicitly made an assertion that versioning is a special type of extensibility. This naturally led to the same techniques for extensibility should be applied, but in a more controlled manner, for versioning. In my mind, the two are along the same axis. Let's step back a bit and think a bit more about these two problems separately before combining them together for a given language.

Starting with versioning, we make a few interesting observations:

Now looking at extensibility, there are some comparable observations:

HTML evolution is a good example of these differences. There were a large number of extensions added by third parties, such as <IMG>, <FORM>, <BLINK>, etc. They were independently developed and did not break backwards compatibility. Eventually many of these were roughly copied into the html namespace name by the HTML working group. The HTML specification went through a linear versioning life-cycle: 1.0, 2.0, 3.2, 4.0, 4.01.

We see that there are a number of differences between extensibility and versioning from a development perspective. Now it is possible that these differences are simply an accident of history, and our new technology such as XML schema and URIs can combine extensibility and versioning together. The junction point is whether or not a new version can refer to extensions for a new version or whether it has to recreate the extension in it's own namespace. Unfortunately, as detailed in the Third Party Multiple Optional Defined Elements section, it's difficult to write a schema so versioning can take advantage of extensions.

Where this leaves us is that we can do some things very well using Schema:

But we are left at a place where we cannot easily combine extensibility and versioning in a distributed environment. Some of the examples of where versioning and extensibility don't completely overlap:

Clearly extensibility and versioning are related, though they appear to have somewhat different requirements. Perhaps unifying them in a single language and set of constructs is possible, but we should keep an open mind to whether this is the right approach.

Survey of solutions

This article has raised a number of issues in designing with Schemas as they exist today, and shown possible solutions. What are the choices that a language designer has today?

It seems that in the short term, the use of an Extensible element as proposed hits a variety of the requirements, particularly decentralized extensibility with full validation. And what are the options in the future?

  1. Wait for Schema to be updated, perhaps some or many of the changes suggested above. As a wildly optimistic guess, Schema 1.1 may be delivered to Recommendation in Summer 2005 (given that the requirements phase is still open). This is followed by probably at least a year before tools are deployed. Then developers would have to switch to the new version of Schema.
  2. Use Validation by Projection. This depends upon whether tooling will support such a functionality or not. Conceivably, it could be deployed within a year.

In the longer term, there seems to be a trade-off between layering VBP on top of a Schema 1.0 or using Schema 1.1. The cost of potentially deploying many tightly coupled Web services over a period of potentially many years until Schema 1.1 is widely deployed versus layering validation by projection on top seems to argue for keeping Schema as it is and mixing in a new validation model.

This paper has provided a simple compatible evolution use case, examined it using today's technology as well as some potential new technologies, and should provide ample grist for the mill of providing loosely coupled schemas and Web components.