I'll use Adam Bosworth's keynote at XML 2003 to kick off some thoughts around Xquery and the web. It seems like an interesting world where a service provider could describe their data model and then allow arbitrary queries against it. The current model of both the Web and Web services is that a service provider needs to provide pre-defined operations for each type of query. In fact, this separation of the private and more general query mechanism from the public facing constrained operations is the essence of the movement we made years ago to 3 tier architectures. SQL didn't allow us to constrain the queries (subset of the data model, subset of the data, authorization) so we had to create another tier to do this.
What would it take to bring the generic functionality of the first tier (database) into the 2nd tier, let's call this "WebXQuery" for now. Or will XQuery be hidden behind Web and WSDL endpoints?
I first tried to re-use the Xquery functionality rather than providing specific operations in the SAML spec. My idea was that instead of SAML defining bunch of operations (getAuthorizationAssertionBySubjectAssertion, getAuthorizationAssertionListBySubjectSubset, ..), that SAML would define a Schema data model which could be queried against. A provider would offer a generic operation (evaluateQuery) which took in the query against that data model. Hence why I worked created a formal domain model in SAML, so it could be queried against.
Now, in this was too early in Xquery's life to work. One of the necessary things was to be able to subset XQuery so only some of the complexity was offered. The security model was handled outside the scope of the individual query, but that would need to be worked in.
Obviously the choice of using a generic xquery interface versus a specific operational interface depends upon the application, and they probably need to be matched. Specific interfaces are useful in many different conditions, but they don't work very well if the client really needs a generic interface. The idea is that currently there is an impedance mismatch in some applications, particularly where a client needs a generic interface but a specific interface is all that is available. They client ends up invoking large numbers of operations and then transforming the retrieved data models into their data model. This leads to brittle and complex clients and providers that can't scale to client demand in functionality and performance.
If this is an interesting idea, of providing generic and specific query interfaces to applications, what technology is necessary? I've listed a number of areas that I think need examination before we can get to XQuery married to the Web and to make a generic second tier.
1. How to express that a particular schema is queryable and the related bindings and endpoint references to send and receive the queries. Some WSDL extensions would probably do the trick.
2. Limit the data set returned in a query. There's simply no way an large provider of data is going to let users retrieve the data set from a query. Amazon is just not going to let "select * from *" happen. Perhaps fomal support in XQuery for ResultSets to be layered on any query result would do the trick. A client would then need to iterate over the result set to get all the results, and so a provider could more easily limit the # of iterations. Another mechanism is to constrain the Return portion of XQuery. Amazon might specify that only book descriptions with reviews are returnable.
3. Subset the Xquery functionality. Xquery is a very large and complicated specification. There's no need for all that functionality in every application. This would make implementation of XQuery more wide spread as well. Probably the biggest subset will be Read versus Update.
4. Data model subsets. Particular user subsets will only be granted access to a subset of the data model. For example, Amazon may want to say that book publishers can query all the reviews and sales statistics for their books but users can only query the reviews. Maybe completely separate schemas for each subset. The current approach seems to be to do an extract of the data subset accoring to each subset, so there's a data model for publishers and a data model for users. Maybe this will do for WebXQuery.
5. Security. How to express in the service description (wsdl or policy?) that a given class of users can perform some subset of the functionality, either the query, the data model or the data set. Some way of specifying the relationship between the set of data model, query functionality, data set and authorization.
6. Performance. The Web has a great ability to increase performance because resources are cachable. The design of URIs and HTTP specifically optimizes for this. The ability to compare URIs is crucial for caching., hence why so much work went into specifying how they are absolutized and canonically compared. But clearly XQuery inputs are not going to be sent in URIs, so how do we have cachable XQueries gven that the query will be in a soap header? There is a well defined place in URIs for the query, but there isn't such a thing in SOAP. There needs to be some way of canonicalizing an Xquery and knowing which portions of the message contain the query. Canonicalizing a query through c14n might do the trick, though I wonder about performance. And then there's the figuring out of which header has the query. There are 2 obvious solutions: provide a description annotation or an inline marker. I don't think that requiring any "XQuery cache" engine to parse the WSDL for all the possible services is really going to scale, so I'm figuring a well-defined SOAP header is the way to go.
Your thoughts? Is WebXQuery an interesting idea and what are the hurdles to overcome?
Dave,
first of all a general comment:
I believe REST could be used for WebXquery. According to R.L.Costello "Building Web Services the REST way"
"Categorize your resources according to whether clients can just receive a representation of the resource, or whether clients can modify (add to) the resource. For the former, make those resources accessible using an HTTP GET. For the later, make those resources accessible using HTTP POST, PUT, and/or DELETE."
So, an HTTP POST would be ok because the (returned) resource would look modified by the query. The XQuery statement would be posted to the URI resource.
Further comment:
In terms of performance in order to use caching, I believe canonicalizing a query thru c14n may not be sufficient because same semantic queries may only differ for a pair of brackets. I believe it would be more beneficial to standardize the format and syntax of an XQuery algebra and to use that for WebXquery (only the algebra notation would be cached). Security, data model subsets and size limitations would all benefit from a standardized Xquery algebra model.
thanks for your site and keep it up.
Good post, Dave. Lots to think about here.