Gymnosperms of the Southeastern US

A Premature Sample of the Use of XML in Systematic Botany

Ron Gilmour, UNC Herbarium Staff

Introduction

XML (eXtensible Markup Language) is a markup language analogous to the more familiar HTML, but allowing the person doing the markup to create tags rather than just apply a standard set of tags to the document at hand. It is a relatively new language and seeks to combine the flexibility of SGML (Standard Generalized Markup Language) with the simplicity of HTML. (There are many resources available on the web with abundant general information about XML and I will not try to reproduce their strengths here. Two of my favorites are the sites from W3C and Oasis.)

XML differs from HTML in that the emphasis is on the structure of information rather than its presentation. While HTML tags provide a mixture of semantic/structural information (i.e. information about the information itself, for instance the <H*> and <TITLE> tags) and presentation information (i.e. information about how the writer of the document wishes the information to be displayed, for instance the <I> and <font> tags), XML seeks to contain only semantic/structural information in the document tags and deals with issues of presentation by linking to a stylesheet.

What this means (or will mean) to the person accessing XML files on the web is that the pages can contain detailed "indexing" of their own content in the form of highly specific tags. Lets say that rather than just the gymnosperms, I had marked up the entire Vascular Flora of the Southeastern United States in XML. You might decide that you are only interested in seeing a list of taxa which are also found in Ohio, or a list of only those names with "Ell." as the authority. In short, the ultimate goal of XML is the presentation of information on the web in a way that will allow users to interact with the information, making a webpage behave in an almost database-like manner. This, of course, is dependent not only on the coding of documents in XML, but also on the ability of the browser to present the user with options. Now the bad news: there is no XML browser. Hence the "premature" in the title of this page. Microsoft Internet Explorer 5 will display XML documents, and will even let you play with them a bit by collapsing and expanding the heirarchically arranged sections of the document, but that’t all you can do. For now. For the present, I’ve constructed a few workarounds to hopefully give viewers the feel of some of what XML will do in the future. So, download Internet Explorer 5 if you don’t already have it, and enjoy!

XML and Biological Taxonomy

XML was created for the display of complex, structured information, so its use for presenting heirarchical classifications seems obvious. One of my goals in making this page available at such an early stage in the implementation of XML is to start people in the systematic community thinking about how XML can best be used in this context. Ultimately, I would like to see an XML Document Type Definition proposed for consistent use within systematic biology.

A Document Type Definition (DTD) is essentially a catalog of elements, with instructions for what the elements may contain and how they are related. (Elements are units of marked up content, as distinct from tags, which are just the markup.) In the gymnosperm document, for instance, I have defined the genus element such that it must contain at least one species element. The species element may not contain a genus element. Likewise for morphological information: the flower element may contain a perianth element, but not a fruit element. [Note: HTML also has a document type definition, but since everyone uses the same DTD, no one has to think about it. The HTML DTD is what tells you that a <TD> tag can only occur inside <TR> tags, which can only occur inside <TABLE> tags, which can only occur inside <BODY> tags, etc.]

The DTD used in my gymnosperm example is very simplistic and is my first attempt at writing a DTD. I certainly do not propose this as a definitive DTD for systematic biology, but am merely providing it as food for thought.

XML As It Might Be

While we’re waiting for someone to come up with a browser which will allow the user to interact with XML documents, we can fake it using stylesheets. Below is a list of options for ways in which you might want to view the gymnosperm data in Internet Explorer 5. To the right of each item is a link which will show the stylesheet used for that display. Once you get a feel for how these work, try downloading hardin1.xml to a disc or your hard drive. (You’ll also need hardinxml.xml and taxon.dtd, which you can download by viewing the index of /herbarium/xml, right-clicking on the files, and choosing "Save Target As.") Then use a text editor to write your own stylesheet. Save the stylesheet as something like mystyle.css in the same directory as hardin1.xml. Then edit line 2 in hardin1.xml so that it refers to your new stylesheet:

<?xml-stylesheet href="mystyle.css" type="text/css" ?>

This will allow you to view hardin1.xml in whatever way you prefer. Note: The gymnosperm data as a whole is stored in a file called hardinxml.xml. This is referred to as an external entity in the code for each of the following examples, which is why if you "view source" you’ll only see three or four lines of code. The entity is declared in taxon.dtd.

View file without a stylesheet. This shows the collapsible heirarchy format which IE5 uses as a default when no stylesheet is designated.  
View file with a simple stylesheet so that it appears as normal text as it would in a book. View the stylesheet.
View file with all the seed cone characters highlighted. View the stylesheet.
View taxa names only. View the stylesheet.

For the next two examples, I have used a simplified XML document called pinus.xml. This consists of only the various Pinus species and does not include morphological information. The result is a dataset which is "symetrical," meaning that each "record" has roughly the same number and types of attributes. This is an important requirement for IE5’s "databinding" features, which use either a table or form-like controls in an HTML page to draw upon the data in an XML document. Incidentally, you will notice that pinus.xml is not valid according to my DTD. Furthermore, any attempt to include a reference to a DTD within pinus.xml renders the document unavailable as a data source object, an IE5 quirk for which I have no explanation.

View data as table within an HTML page. Use "view source" in your browser to see the HTML
View data as an interactive form. Use "view source" in your browser to see the HTML

In addition to CSS stylesheets, a specialized style language called Extensible Stylesheet Language (XSL) is being developed by W3C. There is some controversy surrounding XSL, which, compared to CSS, is rather difficult, and some writers have suggested that XSL is unneccesary. (See "XSL Considered Harmful" by Michael Leventhal.) Nevertheless, it does some things of which CSS alone is not capable. Above, we had to "dumb down" the gymnosperm data in order to use IE5’s databinding features to display the data as a table. In the examples below, the original file was used. Note that the table example above is an HTML page which draws data from an XML file. The following table is an XML page which has been transformed into HTML by an XSL stylesheet. Note also that an XSL stylesheet is a well-formed XML document (as demonstrated by the fact that when you view it using the links below, it shows up in the IE5 XML default format).

View full gymnosperm document as a table via an XSL stylesheet. The data in this table is sorted alphabetically by name using the order-by attribute of the <xsl:for-each> element. View the stylesheet.
View list of generic names only, sorted alphabetically. Note that with XSL, the default display is none, so if we just want one piece of information, we can "ask" for it with a very simple stylesheet. View the stylesheet.
View lists of genera and species, color-coded by authority. This example demonstrates a couple new tricks. First, notice that the XML is transformed into HTML which includes inline CSS stylesheets. This demonstrates how the two languages may be used together. The color-coding by authority is done with the <xsl:choose>, <xsl:when>, and <xsl:otherwise> tags which are used for providing options based on a criterion. View the stylesheet.

That should give you some feel for the types of data manipulation possible with XML. Note that the possibilities are highly dependent both on the markup and on the degree to which available browsers support the various features of XML, CSS, and XSL. Note also that an XML file is, like HTML, nothing but a simple text file. The simplicity and small size of the text format makes it an ideal medium for sharing data, with or without a browser. In chemistry, for instance, Peter Murray-Rust has spearheaded the development of CML (Chemical Markup Language), (a specialized XML DTD) and has even developed a custom browser for it, called Jumbo. It is my hope that a similar DTD may be developed within the systematics community.

For another example of the use of XML from the UNC Herbarium, click here.

©1999 Ron Gilmour
Comments and suggestions are encouraged!
Email the author.