Digital humanities

Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2021-12-27T22:03:57+0000

Getting started with XQuery

About XQuery and the eXist XML database

For XQuery in this course we’ll be using the eXist XQuery database, which we’ve installed on Obdurodon at http://obdurodon.org:8080. If you wind up using XQuery in your own projects, we’ll create accounts for you on eXist (for homework purposes in this course you won’t need separate accounts). An XML database like eXist works by storing the raw XML files along with persistent indexes.

An index is a way of finding the location of a piece of information quickly; in an index at the back of a book, you find what you’re looking for and it tells you the appropriate page number, so that you don’t have to look at every page of the book each time you want to retrieve a piece of information. This makes it possible to search for particular information (element and attribute values, etc.) quickly. What’s persistent about the index is that it’s created and stored in the database along with the data, so it’s always available. Without a persistent index, when you evaluate an XPath path expression, the system has to read the XML file, parse it (analyze its structure), and build a tree-like representation of the structure in memory, which it then searches. An XML database like eXist does the parsing and indexing once and stores the information in a way that provides quick retrieval. All of this is transparent to the user; the system takes care of the indexing quietly, behind the scenes.

The main way we’ll be interacting with eXist will be through eXide, the XQuery integrated development environment, available by clicking on the eXide button on the main eXist page on Obdurodon.

The organization of the eXist database

The eXist database is organized like a hierarchical file system, which means that there is a single root directory that contains subdirectories and files, the subdirectories, in turn, contain subdirectories and files, and so on. The root directory is called db. (In technical XQuery terminology, these directories are called collections, but we’ll use the terms directory and subdirectory as synonyms of collection.) For example, the XML version of Hamlet has been installed in the database in the data subdirectory of the shakespeare subdirectory of the apps subdirectory of db, so its address is /db/apps/shakespeare/data/ham.xml. If you want to use your own XML files in eXist, you have to install (upload) the files into eXist, and we can show you how to do that. This is different from just putting them on the web server in your regular project space, which makes them accessible, but which doesn’t build index files or let you explore them using XQuery expressions.

XQuery and XPath

XQuery uses XPath 3.0 to navigate documents, which it finds in one of the following two ways:

It uses the doc() function to find a single document, where the argument to the function is the path to the document, including the filename. For example, to retrieve Hamlet you would use doc('/db/apps/shakespeare/data/ham.xml').
The collection() function finds a directory that itself may contain XML files. I’ve installed forty-two Shakespearean texts into the database, and to run a query over the entire Shakespearean corpus on the server, instead of addressing a single document with doc(), you address the directory that contains the multiple individual files with collection(). You could address the entire Shakespearean corpus at once, then, with collection('/db/apps/shakespeare/data').

Note that you have to put quotation marks (single or double) around the path to the file or collection you want to use.

doc() and collection() are XPath functions (you can use them in XSLT, too), and once you’ve found the document or collection you want to query, you can use regular XPath path expressions, predicates, functions, axes, etc. For example, to find all of the <speaker> elements that are longer than ten characters in all of the plays, you could type:

declare namespace tei="http://www.tei-c.org/ns/1.0";
collection('/db/apps/shakespeare/data')//tei:speaker[string-length() gt 10]

The Shakespeare corpus in eXist is in the TEI namespace, so you need to declare the namespace and bind it to the prefix tei. You then need to use the prefix when you refer to an element in the document. In XSLT you used the @xquery-default-namespace attribute to specify that your input document was in a particular document, which set it up as a default and saved you from having to use the namespace prefix each time you referred to the document. That attribute isn’t available in XQuery, and while there is an alternative way to specify a default namespace, it has unwanted side effects, and as a result we don’t recommend using it (and we don’t use it ourselves in our own work).

The first step in the path is to find the entire collection of plays. The // following the collection function has its usual XPath meaning: look as deeply inside the collection as you need to look. What it’s looking for is all <speaker> elements in the TEI namespace, which it then filters in a way that should be very familiar to you by now. So far in XSLT you’ve used XPath with just a single document, but when you use the collection() function, you can think of the entire collection as being one level up from each document. That is, in XPath terms, each of the 42 plays is a child of the entire collection.

How we learn XQuery in this course

Our first activities with XQuery will involve writing individual XPath expressions to retrieve information from the database. Once we’ve gained some familiarity with that, we’ll move on to building more complex queries, such as those that can retrieve multiple types of information and assemble them into a fully-formed web page before returning them to the user.

What you can do first

As you’re learning XQuery, use eXide to enter queries and run them. The simplest XQueries are just XPath expressions like the example above that finds speakers in Shakespearean plays. You can retrieve an entire play with:

doc('/db/apps/shakespeare/data/ham.xml')

This finds and returns the document node of the XML file. In most cases you may not want to retrieve an entire XML file in raw form, but it can be useful. At http://menology.obdurodon.org/, the only functionality at the moment (the site is in an early stage of development) is to view an entire document from the database either transformed a certain way with XSLT or as raw XML. The XML files live in eXist, and when we return the raw file, we just pull it out of the database with this type of simple XPath expression. When we transform it, we let eXist return the information we need and run it through an XSLT transformation.

What’s next

You’ll be reading a tutorial about FLWOR expressions in the next few days. Stay tuned!