Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2021-12-27T22:03:57+0000
For XQuery in this course we’ll be using the eXist XQuery database, which we’ve installed on Obdurodon at http://obdurodon.org:8080. If you wind up using XQuery in your own projects, we’ll create accounts for you on eXist (for homework purposes in this course you won’t need separate accounts). An XML database like eXist works by storing the raw XML files along with persistent indexes.
An index is a way of finding the location of a piece of information quickly; in an index at the back of a book, you find what you’re looking for and it tells you the appropriate page number, so that you don’t have to look at every page of the book each time you want to retrieve a piece of information. This makes it possible to search for particular information (element and attribute values, etc.) quickly. What’s persistent about the index is that it’s created and stored in the database along with the data, so it’s always available. Without a persistent index, when you evaluate an XPath path expression, the system has to read the XML file, parse it (analyze its structure), and build a tree-like representation of the structure in memory, which it then searches. An XML database like eXist does the parsing and indexing once and stores the information in a way that provides quick retrieval. All of this is transparent to the user; the system takes care of the indexing quietly, behind the scenes.
The main way we’ll be interacting with eXist will be through eXide, the XQuery integrated
development environment, available by clicking on the eXide
button on the main
eXist page on Obdurodon.
The eXist database is organized like a hierarchical file system, which means that there
is a single root directory that contains subdirectories and files, the subdirectories,
in turn, contain subdirectories and files, and so on. The root directory is called
db
. (In technical XQuery terminology, these directories are called
collections, but we’ll use the terms directory
and
subdirectory
as synonyms of collection
.) For example, the XML version
of Hamlet has been installed in the database in the data
subdirectory of the shakespeare
subdirectory of the apps
subdirectory of
db
, so its address is /db/apps/shakespeare/data/ham.xml
. If you
want to use your own XML files in eXist, you have to install (upload) the files into
eXist, and we can show you how to do that. This is different from just putting them on
the web server in your regular project space, which makes them accessible, but which
doesn’t build index files or let you explore them using XQuery expressions.
XQuery uses XPath 3.0 to navigate documents, which it finds in one of the following two ways:
doc()
function to find a single document, where the
argument to the function is the path to the document, including the filename. For
example, to retrieve Hamlet you would use
doc('/db/apps/shakespeare/data/ham.xml')
.collection()
function finds a directory that itself may contain XML
files. I’ve installed forty-two Shakespearean texts into the database, and to run a
query over the entire Shakespearean corpus on the server, instead of addressing a
single document with doc()
, you address the directory that contains the
multiple individual files with collection()
. You could address the
entire Shakespearean corpus at once, then, with
collection('/db/apps/shakespeare/data')
.Note that you have to put quotation marks (single or double) around the path to the file or collection you want to use.
doc()
and collection()
are XPath functions (you can use them in
XSLT, too), and once you’ve found the document or collection you want to query, you can
use regular XPath path expressions, predicates, functions, axes, etc. For example, to
find all of the <speaker>
elements that are longer than ten
characters in all of the plays, you could type:
declare namespace tei="http://www.tei-c.org/ns/1.0"; collection('/db/apps/shakespeare/data')//tei:speaker[string-length() gt 10]
The Shakespeare corpus in eXist is in the TEI namespace, so you need to declare the
namespace and bind it to the prefix tei
. You then need to use the
prefix when you refer to an element in the document. In XSLT you used the
@xquery-default-namespace
attribute to specify that your input document
was in a particular document, which set it up as a default and saved you from having to
use the namespace prefix each time you referred to the document. That attribute isn’t
available in XQuery, and while there is an alternative way to specify a default
namespace, it has unwanted side effects, and as a result we don’t recommend using it
(and we don’t use it ourselves in our own work).
The first step in the path is to find the entire collection of plays. The //
following the collection function has its usual XPath meaning: look as deeply inside
the collection as you need to look.
What it’s looking for is all
<speaker>
elements in the TEI namespace, which it then filters in
a way that should be very familiar to you by now. So far in XSLT you’ve used XPath with
just a single document, but when you use the collection() function, you can think of the
entire collection as being one level up from each document. That is, in XPath terms,
each of the 42 plays is a child of the entire collection.
Our first activities with XQuery will involve writing individual XPath expressions to retrieve information from the database. Once we’ve gained some familiarity with that, we’ll move on to building more complex queries, such as those that can retrieve multiple types of information and assemble them into a fully-formed web page before returning them to the user.
As you’re learning XQuery, use eXide to enter queries and run them. The simplest XQueries are just XPath expressions like the example above that finds speakers in Shakespearean plays. You can retrieve an entire play with:
doc('/db/apps/shakespeare/data/ham.xml')
This finds and returns the document node of the XML file. In most cases you may not want to retrieve an entire XML file in raw form, but it can be useful. At http://menology.obdurodon.org/, the only functionality at the moment (the site is in an early stage of development) is to view an entire document from the database either transformed a certain way with XSLT or as raw XML. The XML files live in eXist, and when we return the raw file, we just pull it out of the database with this type of simple XPath expression. When we transform it, we let eXist return the information we need and run it through an XSLT transformation.
You’ll be reading a tutorial about FLWOR expressions in the next few days. Stay tuned!