Making XML Documents Searchable through the Web

Dongwook Shin, National Library of Medicine

For the full presentation, click here.

ABSTRACT

As the number of XML documents available in Internet is growing very quickly, search tools that let people find the relevant contents are becoming indispensable. XML search engines may be similar to those popular in World Wide Web, most of which simply find the whole HTML documents. But, it is highly desirable to provide structured search capabilities with which people can retrieve any part of contents, regardless whether it is a whole document or its subpart.

In this paper, we introduce XRS (XML Retrieval System) that is able to do structural search and renders the retrieved contents in the HTML format. XRS uses a couple of new techniques that have been recently developed. One of those is the BUS (Bottom Up Scheme) technique developed for indexing and retrieving structured documents efficiently [1]. BUS indexes only at the leaf elements in a DTD structure, whereas the index information of the intermediate elements is computed at retrieval time with accumulating those of the leaf elements nested in the intermediate ones. By doing so, it allows a user to compose structural queries depending on the DTD structure in a more flexible way than other Web search engines do and get search results quickly.

Secondly, XRS uses a Java component that renders the XML output into the HTML. It facilitates that a user without an XML enabled browser can view the search results retrieved back from the search engine. The rendering component is wrapped in the Query Mediator servlet that resides in the server and mediates the user and the back-end search engine. If a search result is in XML, for instance an XML element itself, the Query Mediator servlet makes it pass through the rendering component. Otherwise it sends the result back directly to the GUI.

In XRS, the GUI is programmed in Java applet, which communicates with the Query Mediator servlet on top of HTTP protocol. The servlet interacts with the back-end search engine via socket, where most of functions are written in the native C code.

References

[1] Dongwook Shin et al., "BUS: An effective Indexing and Retrieval Scheme in Structured Documents," Digital Libraries '98, 1998, pp. 235-243.