The Design of XQL

Jonathan Robie, Texcel Research (jonathan@texcel.no)

The XML Query Language proposal (XQL), jointly developed by Texcel, webMethods, and Microsoft, proposes a query language for XML documents. This document complements the proposal by describing XQL's underlying design and the motivation for that design.

Why Query XML?
What is an XML Query?
Search Contexts and Results
Evaluating an XQL Expression
Completing the XQL Model
References
1. Bibliography
2. Sample XML and SGML Documents

1. Why Query XML?

Traditionally, structured queries have been used primarily for relational or object oriented databases, and documents were queried with relatively unstructured full-text queries. Although quite sophisticated query engines for structured documents have existed for some time, they have not been a mainstream application.

XML documents are structured documents – they blur the distinction between data and documents, allowing documents to be treated as data sources, and traditional data sources to be treated as documents. Some XML documents are nothing more than an ASCII representation of data that might traditionally have been stored in a database. Others are documents that contain very little structure beyond the use of headers and tables. Still others are somewhere in between, e.g. reference works like dictionaries or technical manuals, documents in which “looking something up” is a long-standing tradition that predates computers. Yet other kinds of documents, not commonly entered as structured documents, become incredibly useful as sources of data when properly encoded in XML; for instance, a patient record encoded in XML can become a rich data source for queries about medical history, diagnoses, treatments, and billing information (c.f. the HL7 Kona Proposal, “http://www.mcis.duke.edu:80/standards/HL7/sigs/sgml/WhitePapers/KONA/” ).

As more and more information is either stored in XML, exchanged in XML, or presented as XML through various interfaces, the ability to intelligently query our XML data sources becomes increasingly important. For instance, consider the applications mentioned in Jon Bosak’s seminal paper “XML, Java, and the future of the Web” (http://sunsite.unc.edu/pub/sun-info/standards/xml/why/xmlapps.html). When XML is used as a universal interchange format, it is often desirable to also have a universal query language for requesting relevant data. When applets use Java to persist or parse data, it is helpful to allow them to query for the data they need. When multiple views of document data are desired, a query language is an ideal means of specifying these views. Intelligent agents using XML for data discovery are much more powerful if they can discover and query their data sources. In short, most of the applications to which XML is particularly well suited are enhanced by the availability of a suitable query language.

1.1. Problem domains for XML Queries

XQL was designed as a general-purpose query language for XML. During the design of the language, four problem domains were chosen to determine the requirements:

Queries within a single document (e.g. in a browser or editor)
Queries in collections of documents (e.g. document assembly in an XML repository)
Addressing within or across documents (c.f. XLL, XPointers)
XSL Patterns

These are actually quite distinct problem domains, and not all of these are queries in the traditional sense of the term – XPointers identify known locations in specific documents, XSL patterns are often evaluated for specific elements that are being processed in order to determine whether a template applies, and queries determine whether data that meets specific conditions are present, returning the relevant data if found. Because of the notable differences in the tasks, the SGML and XML communities have traditionally treated these as completely separate domains.

There is, however, a great deal of commonality among these tasks, especially with respect to their need to specify XML structures. The processing that is done may be quite different in the different domains; still, the needs of each are quite similar on the query-language level. In some ways this is analogous to the differing uses of SQL in traditional relational databases. SQL may be used in a variety of ways that involve different forms of processing, e.g. to establish views, execute queries, or set cursors, but the query language used to perform these various tasks is the same.

Let’s consider some of these problem domains to determine how a query language might be used, the requirements for such a query language, and the ramifications each has for the design of a multi-purpose query language.

Addressing within or across documents is useful for referencing known locations in documents using hyperlinks. Anyone who has used a web browser is familiar with the links provided by HTML, which allow documents to be addressed by URL, and allow anchors to be placed within a document to enable direct addressing to specific named locations within a document., where the names are defined in attributes. In general, a link resolves to a single address in a document, and selecting a link navigates to that address. A single web document often contains many links to other documents, or to specific sections within the same document. In the XML and SGML worlds, several more sophisticated varieties of links have existed for a long time, including TEI Links, HyTime Links, and XML Linking. These all allow greater flexibility in the kinds of links that may be specified, including:

Addressing any node in a document, with no need to modify a node to make it addressable.
Specifying relative paths from a known location.
Specifying absolute paths in a document tree.

Individual nodes may be addressed by any of these criteria:

The name of any element or attribute
The type of a node (element, attribute, processing instruction, comment, etc.)
The content or the value of any node
Relationships among nodes identified by the above criteria, including hierarchy, sequence, and position.

It is worth noting that the languages that have been designed for this kind of sophisticated linking are non-trivial, and the reason for this is simply that the relationships found in documents can be fairly difficult to describe well.

Queries within a single document are useful in XML browsers or editors to allow the user to query large documents and find relevant information without scrolling through the entire document. They may be used in scripting languages to provide powerful non-procedural access to document data and structures. In addition, they may be used by document authors to define various views of a document, e.g. for users with varying background, or users with differing access rights. The input to such a query is an entire document, which may be of any size. A sophisticated implementation may have indexes defined for the document, and may have ways to avoid loading the entire document into memory; a relatively naīve implementation may simply search the entire document. The result of such a query may differ among implementations; e.g., the result may be an iterator that traverses precisely those nodes returned by the query, in document order, or it may be a set of addresses that may be used to jump to the appropriate position in the original document, or it may be a navigation object that allows the nodes returned by the query to be traversed as a virtual tree, i.e. a view containing a subset of the original document.

Although the manner in which results are retrieved and returned may differ considerably, the role of the query language is the same in all of these cases: it must describe the set of nodes that should be returned.

The name of any element or attribute, or the target of a processing instruction
The type of a node (element, attribute, processing instruction, comment, etc.)
The content or the value of any node
Relationships among nodes identified by the above criteria, including hierarchy, sequence, and position.

When the results of a query are returned as an iterator or a view, it is important for the results to be capable of being returned in document order so that the results can be intelligible as a document.

Queries in collections of documents are useful in a wide variety of settings, including document assembly using XML repositories, queries performed on a single web site or across web sites, and data mining. The input to such a query is a set of documents or a set of nodes within multiple documents. The range of possible outputs is the same as we have described for single documents above, except that the output from multiple documents must be represented. The addressing requirements are also the same, except for the additional need to be able to address the individual documents.

XSL Patterns are used to specify tree-to-tree transformations on documents. One phase of these transformations involves testing individual nodes based on their properties and the properties of related nodes. The output of an XSL transformation is a tree, transformed using the rules specified in the XSL templates. The set of tests applied to a node by XSL patterns is the same as the set of criteria we have mentioned for queries:

The name of any element or attribute, or the target of a processing instruction
The type of a node (element, attribute, processing instruction, comment, etc.)
The content or the value of any node
Relationships among nodes identified by the above criteria, including hierarchy, sequence, and position.

1.2. The role of a query language

Our brief examination of various problem domains shows that they vary widely in terms of their input, output, processing model, and the desired form in which results are returned. However, they all have one thing in common: they need to be able to specify one or more nodes based on assertions about their names, content, values, and relationships to other nodes (whose names, content, and values may also be specified). XQL is a means of specifying these assertions. Since XQL says nothing about the manner in which input is provided, the format of the output, or the processing model used to apply these assertions, it is completely independent of the factors which distinguish these problem domains.

At first blush, it might seem reasonable to design a separate query language for each problem domain. This is probably not the best approach. A query language for XML must make it easy to express all of the assertions that might commonly be made about nodes found in an XML document and their relationships to other nodes. It is difficult to design a good query language, and history shows that many special-purpose query languages have not stood the test of time. A sufficiently general query language is useful for many different problems, and continues to serve well if the original problem grows beyond the original scope. Moreover, a serious web developer typically needs to work with a wide variety of standards, and if each standard has its own query language, this significantly increases the number of things that developers need to learn. However, it is also important to note that different subsets of a query language may be required by different applications, and applications may differ in the sophistication of their implementations.

Simple applications that use a query language are largely concerned with speed of implementation. For these applications, it is important that the query language be easily parsed, that the semantics be unambiguous and easily implemented, and that acceptable performance can be achieved by a naīve implementation.

Applications that must provide good performance for large document sets have very different concerns. Although the designers of these applications are also grateful for an easily parsed language with clear semantics, their main concern may well be query optimization, the ability to design efficient indexes to support queries, and executing queries without loading each entire document into memory.

We have attempted to design XQL in a manner that takes the concerns of these various problem domains into account. Naturally, it may not be possible to develop one query language that is suitable to all problem domains, and there may be good reasons for special-purpose query languages; furthermore, much more interaction with professionals in various problem domains is needed in order to determine whether XQL adequately meets their needs, could meet their needs with appropriate modification, or is simply the wrong solution for a particular domain. However, we are optimistic that XQL will be found useful in a variety of problem domains.

1.3. String representation issues

Initially, the requirements for XQL were functional, but when we considered the range of problem domains in which it might be used, we found that these domains impose constraints on the string representation of queries. To be a good citizen of the web, a query language should be useable in a variety of environments: in programming language and scripting language strings, in URLs, and as attributes in documents or XSL templates,

To understand the importance of string representation issues, let’s explore some of the environments in which an XQL query might be used. XQL queries can easily be typed as strings on a command line, generated by graphical query interfaces, or embedded as strings in programs. For instance, here is an XQL query that returns author elements that are children of front elements:

front/author

In programming languages, XQL queries may be assigned to strings and executed as queries. Here is the same query expressed as a Java String:

String qstring = "front/author";

Because queries may be typed or read manually, they should be compact and readable.

XQL queries may be used as part of URLs to provide fine-grained addressing within documents. Here is the same query used as part of a URL:

http://www.example.com/docs#front/author

Since spaces generally may not be used in URLs, they are optional in all XQL queries. Many characters are not allowed on URLs, so XQL has a limited character set. Note: We have received conflicting information regarding some questions of character usage in URLs. There are questions that need to be resolved with respect to this issue.

XQL queries may also be embedded in attributes of HTML or XML documents. Here is the same query in an HREF attribute:

<a href="http://www.example.com/docs#front/author">

Attributes impose further limitations on the characters that may be used, ruling out “<” and “&” as characters.

2. What is an XML Query?

In SGML and XML circles, there has been a great deal of discussion about exactly what constitutes a query for structured documents. The question initially strikes most as simple, but there are a variety of possible answers to this question, and when people ask whether systems such as XPointers or XSL patterns provide queries, the question is impossible to answer without first defining what is meant by a query.

There are actually a number of legitimate ways to define queries for documents, and the word query has been used in a variety of ways in computer science, so the way we define queries in this section is not normative for all systems that implement queries for XML. Our purpose here is merely to explain what we mean by a query in the context of XQL, and to present a simple model, which will serve as a framework for the rest of this document.

2.1. Queries, search contexts, and result sets

To examine the characteristics of an XML query, it is useful to consider four basic questions about the environment in which a query takes place:

What is a database?
What is the query language?
What is the input to a query?
What is the result of a query?

The following table provides a brief answer to each of these questions, including a comparison with the SQL query language, which is widely used for querying relational databases:

SQL	XQL
The database is a set of tables.	The database is a set of one or more XML documents.
Queries are done in SQL, a query language that uses tables as a basic model.	Queries are done in XQL, a query language that uses the structure of XML as a basic model.
The FROM clause determines the tables which are examined by the query.	A query is given a set of input nodes from one or more documents, and examines those nodes and their descendants.
The result of a query is a table containing a set of rows.	The result of a query is a set of XML document nodes, which can be wrapped in a root node to create a well-formed XML document.

To illustrate these concepts more concretely, let’s look at a relatively simple XQL query, examining the input to the query, the query itself, and the result. In this example, the input to the query (known as the “search context”) is a <novel> element, which is the root of a document:

Search Context:

<novel>

  <front>

    <title>The Heart of Darkness</title>

    <author>Joseph Conrad</author>

  </front>

</novel>

In XQL, the simplest possible query is an unadorned string, which represents an element name. Thus, “novel” is a full query, and asks for all <novel> elements from the current search context:

Query:

novel

The result set of this query is the set of all <novel> elements in the search context. For our example, since there is only one <novel> element, the result set is equivalent to the search context:

Result Set:

<novel>

  <front>

    <title>The Heart of Darkness</title>

    <author>Joseph Conrad</author>

  </front>

</novel>

Note that both the search context and the result set for this example contain one node each. We have shown the children of the <novel> element in both the search context and the result set because environments that return XQL results as ASCII would return the children as well.

2.2. XML as a data model

An important motivation for the design of XQL is the realization that XML has its own implied data model, which is neither that of traditional relational databases nor that of object oriented or object-relational databases. The first step toward designing a useful query language for XML documents is an adequate understanding of their implicit structure.

Each node in an XML document has a type and either content or a value. In addition, elements and attributes have names, and may have prefixes or namespaces associated with these names. Conditions for a node may refer to any of these properties.

It is also important to note that the relationships among data contain a large proportion of the information contained in a document, which is one of the reasons that structured document formats like XML are useful in the first place. We believe that the following relationships, which we explain in this section, are fundamental to the semantics of XML documents:

Hierarchy
- parent/child
- ancestor/descendant
Sequence (within a sibling list or in document order)
- immediately precedes
- precedes
Position (within a sibling list or in document order)
- absolute
- relative
- ranges

These relationships form the basic logical structure of XQL. They are also central to several other systems designed for addressing into SGML documents, including TEI Pointers, which have been used in academic circles for many years to index into documents, and XPointers, an addressing language being developed as part of the W3C XML activity. We see this as confirmation of the fundamental importance of these relationships in XML documents.

In addition, the hierarchy and sequence portions correlate precisely with the basic structural relationships of Transformational Grammar. For instance, Radford lists the basic relationships of Transformational Grammar as “dominates”, “immediately dominates”, “immediately precedes”, and “precedes”; these are equivalent to our “ancestor/descendant”, “parent/child”, “immediately precedes”, and “precedes” relationships, respectively. Because XQL has the same relationships, we can assume that it also has similar expressive power to Transformational Grammar for the relationships it expresses.

However, it is important to note that sequence, while important for documents, is not important in many applications that use XML mainly to persist data from objects, relational databases, and other data sources. Therefore, some XQL implementations may choose not to support the “immediately precedes” and “precedes” relationships.

2.3. Result sets vs. result documents

In many environments it is useful for the results of a query to be presented as well-formed XML documents. Some reasons for this include:

An XML document is easily parsed with a standard XML parser, so it can be transmitted as a single ASCII stream and parsed by the receiving application.
An XML document can be displayed in a standard XML browser.
An XML document can be stored in an XML repository.
An XML document can be passed on to an XSL processor to perform transformations or do formatting.

In the example we showed above, the result set contains only one node. Whenever a query returns more than one node, though, a text representation of the result set is not a well-formed XML document, because an XML document can have only one root node.

To illustrate this problem, let’s do a slightly more complicated query. A wildcard (‘*’) matches any element, regardless of element name, and the parent/child operator (‘/’) indicates a parent/child relationship, so the following query searches for all children of <front> elements which are children of <novel> elements:

Query:

novel/front/*

The result set of this query contains two nodes:

Result Set:

<title>The Heart of Darkness</title>

<author>Joseph Conrad</author>

Because this result set contains two nodes, it is not a valid XML document. However, if we wrap the nodes of this set in a common root element, we then have a valid XML document. Therefore, the “result document” of an XQL query always wraps the nodes of the result set in an <xql:result> element:

Result Document:

<xql:result>

  <title>The Heart of Darkness</title>

  <author>Joseph Conrad</author>

</xql:result>

3. Search Contexts and Results

In this document, most XML constructs in this paper will be explained briefly, then illustrated with an example like this:

Search Context:

<title>The Heart of Darkness</title>

<author>Joseph Conrad</author>

Query:

title

Result:

<xql:result>

  <title>The Heart of Darkness</title>

</xql:result>

The search context is the set of XML document nodes for which the query is evaluated. A query engine executes an XQL query for a given search context, which may be the set of nodes at the root of a document, or any other set of nodes made available to the query environment.

The result set is the set of nodes selected by the query. Since many environments will want to process the result set further, e.g. by transforming or formatting it with XSL, XQL represents results as a well-formed XML document called the result of the query. Since XML documents may have only one root node, the result of a query wraps the result set in an <xql:result> element; this is the only difference between the result and the result set. There are some query expressions that return text that can not be converted to a well-formed document simply by placing the result set in an element wrapper; e.g., when attributes are returned without the element that contains them.

XQL specifies nothing about the physical representation of the result. The result may be represented in a variety of ways, e.g. as ASCII text, as Document Object Model nodes, as a set of hypertext links, by setting an internal cursor, or by executing a procedure to process the selected nodes. In this document, we will represent both the search context and the query result as ASCII text.

Selecting a node is defined as placing it in the result set. In result sets, XQL preserves the sequence and hierarchy of document nodes, maintaining the same sequence and hierarchy found in the original document. All nodes are unique - a node may appear only once in the result set. Nodes from a given document are returned in document order, which is defined as the order of the start-tags of the elements in an ASCII representation of the document.

4. Evaluating an XQL Expression

Much of this paper is concerned with the way queries or query expressions are evaluated for a given search context. Consider the example shown earlier:

Search Context:

<title>The Heart of Darkness</title>

<author>Joseph Conrad</author>

Query:

title

This query evaluates the term “title” for the search context. The term evaluates to the set of <title> elements in its search context. When a query has operators, evaluation becomes somewhat more complex. When an operator is evaluated for a given search context, it selects the appropriate search context for each of its operands. The search context for which an operand is evaluated need not be identical to the search context of the query. For instance, the following query searches for <title> elements inside <front> tags:

Search Context:

<front>

  <title>The Heart of Darkness</title>

  <author>Joseph Conrad</author>

</front>

Query:

front/title

The search context for the child operator ‘/’ is the same as the search context for the query. The child operator evaluates its left-hand operand, “front”, in this search context, obtaining the set of <front> elements. For each <front> element in the set, it establishes a new search context, which consists of the children of that <front> element, and evaluates the right-hand operand, “title”, for that new search context. Taken as a whole, the query evaluates to the set of <title> elements that have <front> elements in the current search context as parents.

For now, it is not important to understand exactly how the child operator is evaluated, but it is important to note that the manner in which an operator or a term is evaluated for a given search context conveys the precise meaning of that operator or term. It is also important to note that the evaluation of an operator or term is meant as an abstract description of what should be evaluated, not how it should be evaluated. In many cases, an implementation that woodenly follows the steps of the evaluation shown here will be quite inefficient, and the optimal strategy depends greatly on the representation of the document, available indexes, and other implementation-specific issues.

Most XQL expressions will evaluate either to a set of nodes or a Boolean value (true or false). All XQL query expressions may be said to evaluate to true or false. An expression evaluates to true if it is one of the following:

A true Boolean value
A set of Booleans containing at least one true value
A non-empty set
A single node (which evaluates to a non-empty set containing that node, and therefore to true)

An expression evaluates to false if it is one of the following:

A false Boolean value
A set of Boolean values containing no true values
An empty set

5. Completing the XQL Model

This section introduces return operators and sequence, which are basic to the complete XQL model, but not necessary for all XQL implementations(2). Return operators are analogous to the SELECT statement in SQL, and allow much better control over what is returned from a query. However, they are not necessary for all applications, since many applications generally return single nodes from queries or have other very simple requirements for what is returned. Sequence allows the order in which data appears in a document to be used in query conditions, and is extremely helpful for many kinds of document data. However, many applications are not concerned about sequence. In a relational database, the sequence of rows or columns is insignificant, and relational theory explicitly states that these sequences may have no hidden meaning. In objects, the sequence in which the attributes of an object are declared has no meaning. Therefore, systems that deal primarily with these kinds of data generally do not care about sequence.

In the complete XQL model, conditions for individual nodes may include:

Conditions on element or attribute names (e.g. “author”, “@id”)
Conditions on content or values (e.g. “author = ‘Theodore Seuss Geisel”, @id=“id5001”)
Conditions on node type (e.g. “element()”, “pi()”)

As we have mentioned in a previous section, the basic relationships among nodes are:

Hierarchy
- parent/child (“/”)
- ancestor/descendant (“//”)
Sequence
- immediately precedes (“;”)
- precedes (“;;”)
Position (“[ ]”)
- absolute
- relative
- ranges

Conditions for nodes and conditions for the relationships among nodes are combined to form path expressions. A query searches for paths within the search context that match the path expression. Return operators are used to select specific nodes from matching paths so that they will be returned from the query. They are analogous to the SELECT statement in SQL.

There are two kinds of return operators:

Shallow returns (“?”) return one node only
Deep returns (“??”) return a node and all of its children.

The following table shows the precedence among all of the operators in the full XQL query model:

Query Operators by Decreasing Precedence

Grouping	()
Filter	[]
Return	? ??
Path	/ //
Comparison	= != < <= > >= $eq$ $ne$ $lt$ $le$ $gt$ $ge$ $ieq$ $ine$ $ilt$ $ile$ $igt$ $ige$
Intersection	$intersect$
Union	$union$ \|
Negation	$not$
Conjunction	$and$
Disjunction	$or$
Sequence	; ;;

5.1. Selecting Nodes for the Result Set

In queries that do not use return operators, the result set of the query is the same as the set of nodes to which the query evaluates. However, XQL’s return operators that allow nodes to be placed in the result set explicitly. Consider the following query:

front/title

This query evaluates to a set of <title> elements. By placing a return operator (‘??’) after “front”, we can modify this so that it selects <front> elements instead of <title> elements:

front??/title

Now it is time to draw an important, but subtle, distinction. The nodes to which a query evaluates are not necessarily the same as the nodes that the query selects. Evaluation has to do with the manner in which the operators and terms are computed with respect to a search context. Selecting a node means to place it in the result set. The expressions “front/title” and “front??/title” both evaluate to a set of <title> elements, but the first selects the set of <title> elements, and the second selects a set of <front> elements.

There are two ways for a node to be selected:

If no return operators are present, the set of nodes to which the query evaluates is selected. XQL-Patterns does not allow return operators, and always selects the set of nodes to which the query evaluates.
Return operators always select the objects to which they apply, provided they appear in an expression that evaluates to true. If at least one deep return operator is present in a query, the query selects only the objects for which there are return operators, and does not select the set of nodes to which the query evaluates. Therefore, when return operators are used, the set of nodes to which a query evaluates and the set of nodes that a query returns need not be identical, and often are not.

5.2. Return Operators (“?”, “??”)

XQL has two kinds of return operators. The shallow return operator (“?”) returns just the node to which it is applied. For instance, if it is applied to an element, it does not cause attributes or children of that element to be returned. The deep return operator (“??”) returns the element and all its children. Return operators can simplify queries for complex document structures. We will use the following sample data as a basis for our discussion of return operators:

<?xml version="1.0"?> 
<invoicecollection> 
  <invoice> 
    <customer> Wile E. Coyote, Death Valley, CA </customer> 
    <annotation> 
         Customer asked that we guarantee return rights if 
         these items should fail in desert conditions. This 
         was approved by Marty Melliore, general manager. 
    </annotation> 
    <entries n=2> 
      <entry quantity=2 total_price="134.00"> 
        <product maker="ACME" prod_name="screwdriver" price="80.00"/> 
      </entry> 
      <entry quantity=1 total_price="20.00"> 
        <product maker="ACME" prod_name="power wrench" price="20.00"/> 
      </entry> 
    </entries> 
  </invoice> 
  <invoice> 
    <customer> Camp Mertz </customer> 
    <entries n=2> 
      <entry quantity=2 total_price="32.00"> 
        <product maker="BSA" prod_name="left-handed smoke shifter" price="16.00"/> 
      </entry> 
      <entry quantity=1 total_price="13.00"> 
        <product maker="BSA" prod_name="snipe call" price="13.00"/> 
      </entry> 
    </entries> 
  </invoice> 
</invoicecollection>

Suppose you wanted to see all products that occur on an invoice. You could do this with the following query:

invoice//product

Here are the results of the above query for our sample data:

        
<xql:result>
   <product maker="ACME" prod_name="screwdriver" price="80.00"/> 
   <product maker="ACME" prod_name="power wrench" price="20.00"/> 
   <product maker="BSA" prod_name="left-handed smoke shifter" price="16.00"/> 
   <product maker="BSA" prod_name="snipe call" price="13.00"/>
</xql:result>

Unfortunately, the results do not show which products are found on the same invoice. A shallow return ("?") on the <invoice> element returns the <invoice> element, providing an element within which the products can be listed:

invoice?//product

Here are the results of the above query:

        
<xql:result>
   <invoice>
     <product maker="ACME" prod_name="screwdriver" price="80.00"/> 
     <product maker="ACME" prod_name="power wrench" price="20.00"/> 
   </invoice>
   <invoice>
    <product maker="BSA" prod_name="left-handed smoke shifter" price="16.00"/> 
    <product maker="BSA" prod_name="snipe call" price="13.00"/>
   </invoice>
</xql:result>

Suppose we wanted to see the customer for each invoice, together with the product. This can be done by specifying both <product> and <customer> using the deep return operator:

invoice?[customer??]//product??

Here are the results of this query:

        
<xql:result>
   <invoice>
     <customer> Wile. E. Coyote </customer> 
     <product maker="ACME" prod_name="screwdriver" price="80.00"/> 
     <product maker="ACME" prod_name="power wrench" price="20.00"/> 
   </invoice>
   <invoice>
    <customer> Camp Mertz </customer> 
    <product maker="BSA" prod_name="left-handed smoke shifter" price="16.00"/> 
    <product maker="BSA" prod_name="snipe call" price="13.00"/>
   </invoice>
</xql:result>

Conditions may be added to various elements in such a query, and. the results that are returned may be on a different branch from those used as the basis for conditions. For instance, the following query returns customers who ordered left-handed smoke shifters:

invoice[customer??]//entry/product[@prod_name="left-handed smoke shifter"]

The following shows invoices for which Camp Mertz ordered a left-handed smoke shifter:

invoice??[customer="Camp Mertz"]//entry/product[@prod_name="left-handed smoke shifter"]

We can take a fairly complex query, observe that the correct results are returned, then move the return operators around to obtain different results using the same conditions. For instance, we can take the above query, and return just the customer and the product, grouped by invoice:

invoice?[customer??="Camp Mertz"]//entry/product??[@prod_name="left-handed smoke shifter"]

The only difference in these queries is the placement of the return operator. This makes it easy to recycle thought when constructing queries.

5.3. Sequence

In systems where XML is used mainly to represent data from object oriented systems or relational databases, sequence may not be particularly important. However, sequence is often very important to the meaning of documents. For instance, consider the following table:

Song	Mode
Shady Grove	Aeolian
Over the River, Charlie	Dorian

Someone may want to ask what mode the song “Shady Grove” is in. In HTML, the above table may be represented like this (omitting the headers to keep the example short):

<TABLE>

  <ROWS>

    <TR>

      <TD>Shady Grove</TD>

      <TD>Aeolian</TD>

    </TR>

    <TR>

      <TD>Over the River, Charlie</TD>

      <TD>Dorian</TD>

    </TR>

  </ROWS>

</TABLE>

In this example, the mode for “Shady Grove” is found in the <TD> element that immediately follows the <TD> containing the value “Shady Grove”. The “immediately precedes” operator (“;”) selects adjacent nodes. This query returns both the TD that contains “Shady Grove” and the TD that immediately follows it:

TD= "Shady Grove" ; TD

The above query searches for any sequence of two TD tags in which the first is equal to “Shady Grove”. However, it may be combined with hierarchy conditions for more specific searches; e.g. the following search looks for such a sequence only within a table (i.e. within the subtree found beneath a TABLE element):

TABLE // (TD= "Shady Grove" ; TD)

The previous example discusses the “immediately precedes” relationship, which specifies the relative position of two adjacent nodes. The “precedes” relationship, which specifies that one node occur prior to the other node, but does not specify that they be adjacent, is also important to the structure of documents. Consider the following excerpt from Hamlet:

<SPEECH>

  <SPEAKER>MARCELLUS</SPEAKER>

  <LINE>'Tis gone!</LINE>

  <STAGEDIR>Exit Ghost</STAGEDIR>

  <LINE>We do it wrong, being so majestical,</LINE>

  <LINE>To offer it the show of violence;</LINE>

  <LINE>For it is, as the air, invulnerable,</LINE>

  <LINE>And our vain blows malicious mockery.</LINE>

</SPEECH>

Suppose an actor playing the ghost wants to know when to exit; that is, he wants to know who says what line just before he is supposed to exit. The line immediately precedes the stagedir, but the speaker may occur at any time before the line. In this query, we will use the “precedes” operator (“;;”) to identify a speaker that precedes the line somewhere within a speech. Our ghost can find the required information with the following query, which selects the speaker, the line, and the stagedir:

SPEECH // (SPEAKER ;; LINE ; STAGEDIR= "Exit Ghost")

5.3.1. Immediately precedes (‘;’)

The “immediately precedes” operator evaluates to the set of node pairs in the search context in which the left operand immediately precedes the right operand. Both the left operand and the right operand will be in the evaluation.

Search Context:

<SPEECH>

  <SPEAKER>HORATIO</SPEAKER>

  <LINE>'Tis here!</LINE>

</SPEECH>

<SPEECH>

  <SPEAKER>MARCELLUS</SPEAKER>

  <LINE>'Tis gone!</LINE>

  <STAGEDIR>Exit Ghost</STAGEDIR>

  <LINE>We do it wrong, being so majestical,</LINE>

  <LINE>To offer it the show of violence;</LINE>

  <LINE>For it is, as the air, invulnerable,</LINE>

  <LINE>And our vain blows malicious mockery.</LINE>

</SPEECH>

The following query asks for pairs of elements in which <SPEAKER> immediately precedes <LINE>:

Query:

SPEECH/(SPEAKER ; LINE)

Result:

<xql:result>

  <SPEAKER>HORATIO</SPEAKER>

  <LINE>'Tis here!</LINE>

  <SPEAKER>MARCELLUS</SPEAKER>

  <LINE>'Tis gone!</LINE>

</xql:result>

5.3.2. Precedes (‘;;’)

The “precedes” operator evaluates its left-hand operand for the search context, then, for each node in the left-hand evaluation, evaluates to that node and all nodes that follow it.

Search Context:

<SPEECH>

  <SPEAKER>MARCELLUS</SPEAKER>

  <LINE>'Tis gone!</LINE>

  <STAGEDIR>Exit Ghost</STAGEDIR>

  <LINE>We do it wrong, being so majestical,</LINE>

  <LINE>To offer it the show of violence;</LINE>

  <LINE>For it is, as the air, invulnerable,</LINE>

  <LINE>And our vain blows malicious mockery.</LINE>

</SPEECH>

The following query asks for sets of sequences of elements in which a <STAGEDIR> element precedes <LINE> elements:

Query:

SPEECH/(STAGEDIR ;; LINE)

Result:

<xql:result>

  <STAGEDIR>Exit Ghost</STAGEDIR>

  <LINE>We do it wrong, being so majestical,</LINE>

  <LINE>To offer it the show of violence;</LINE>

  <LINE>For it is, as the air, invulnerable,</LINE>

  <LINE>And our vain blows malicious mockery.</LINE>

</xql:result>

6. References

6.1. Bibliography

The Object Database Standard : Odmg 2.0 (Morgan Kaufmann Series in Data Management Systems) by R. G. G. Cattell (Editor), Douglas K. Barry (Editor), Dirk Bartels (Editor). July 1997. The OQL query language was a major influence on the design of XQL, though the syntaxes are quite different.
“XML, Java, and the future of the Web”, Jon Bosak, Sun Microsystems. (http://sunsite.unc.edu/pub/sun-info/standards/xml/why/xmlapps.html).
“XML Linking Language (XLink)”, World Wide Web Consortium Working Draft 3-March-1998, Eve Maler (Arbortext) and Steve DeRose (INSO Corp and Brown University), editors. (http://www.w3.org/TR/1998/WD-xlink-19980303).
“XML Pointer Language (XPointer)”, World Wide Web Consortium Working Draft 3-March-1998, Eve Maler (Arbortext) and Steve DeRose (INSO Corp and Brown University), editors. (http://www.w3.org/TR/1998/WD-xptr-19980303).
“Namespaces in XML”, World Wide Web Consortium Working Draft 2-August-1998, Tim Bray (Textuality), Dave Hollander (Hewlett-Packard Company, Andrew Layman (Microsoft), editors. (http://www.w3.org/TR/WD-xml-names).
“Document Object Model” (http://www.w3.org/TR/WD-DOM/).
“Uniform Resource Locators”, Internet Engineering Task Force RFC 1738. (http://www.w3.org/Addressing/rfc1738.txt).
“Relative Uniform Resource Locators”, Internet Engineering Task Force RFC 1808. (http://www.w3.org/Addressing/rfc1808.txt).
“Guidelines for Electronic Text Encoding and Interchange.” C. M. Sperberg-McQueen and Lou Burnard, editors. (http://www.uic.edu/orgs/tei/, http://etext.virginia.edu/TEI.html).
“Hypermedia/Time-based Structuring Language (HyTime).” ISO/IEC 10744-1992 (E). (http://www.ornl.gov/sgml/wg8/docs/n1920/html/n1920.html).

6.2. Sample XML and SGML Documents

Another important influence on XQL was a testbed of documents that included a number of proprietary documents, together with these publicly available documents. Exploring the structure of real documents was an important part of designing the query language.

Joseph Conrad’s “Heart of Darkness”, tagged by David Megginson, available at: “http://home.sprynet.com/sprynet/dmeggins/texts/darkness/index.html”.
John Spinoza’s Surgical Pathology reports, developed as sample data for the HL7 Kona Proposal, available at: “http://www.mcis.duke.edu:80/standards/HL7/sigs/sgml/WhitePapers/KONA/Apps/”.
The XML 1.0 Recommendation, available at: “http://www.w3.org/TR/1998/REC-xml-19980210.xml”.
The Works of Shakespeare, tagged by Jon Bosak, available at: “http://sunsite.unc.edu/pub/sun-info/xml/eg/shakespeare.1.10.xml.zip”.
The “Religion 101” archives, tagged by Jon Bosak, containing the Old Testament, New Testament, Koran, and the Book of Mormon, available at: “http://sunsite.unc.edu/pub/sun-info/xml/eg/religion.1.10.xml.zip”.
“Notes on DTD Recombination”, a paper on combining documents of different DTDs using namespaces, including useful sample documents, by C. M. Sperberg-McQueen, available at: “http://www.uic.edu/~cmsmcq/tech/xml/munging.html”.

The Design of XQL

Contents

1. Why Query XML?

1.1. Problem domains for XML Queries

1.2. The role of a query language

1.3. String representation issues

2. What is an XML Query?

2.1. Queries, search contexts, and result sets

2.2. XML as a data model

2.3. Result sets vs. result documents

3. Search Contexts and Results

4. Evaluating an XQL Expression

5. Completing the XQL Model

5.1. Selecting Nodes for the Result Set

5.2. Return Operators (“?”, “??”)

5.3. Sequence

5.3.1. Immediately precedes (‘;’)

5.3.2. Precedes (‘;;’)

6. References

6.1. Bibliography

6.2. Sample XML and SGML Documents