Digital Libraries Research Agenda Report

A "Strawman" Report

for the

IITA Digital Libraries Workshop

Hector Garcia-Molina

This document was developed -- in the form of a workshop report -- prior to the Workshop, as a means of focusing discussion and providing some positions to which the attendees might react.

Starting Point: NII Report

On February 28 and March 1, 1994, various organizations (including the Computing Research Association, the Council on Competitiveness, and the Cross Industry Working Team) sponsored a comprehensive workshop on the research and infrastructure needs for the emerging National Information Infrastructure (NII). The report from that workshop (available from EDUCOM at nii-forum@educom.edu) discusses in great detail the principal challenges and makes extensive research and development recommendations. Since digital libraries are an important component of the NII and share many of the same challenges, we do not wish here to redevelop the same type of comprehensive list of research issues. Instead, we will simply summarize the main findings of the NII report, refer the reader to the full report for additional information, and focus here on highlighting the differences and the needed library-specific priorities.

According to the NII report, it will support advanced applications by providing: (a) thousands of information repositories, (b) wide bandwidth data networks and information appliances, and (c) advanced communications and information access services.

The report identifies critical technical challenges in the following areas:

(1) Network components that can handle voice, video, and text simultaneously, and can operate seamlessly.

(2) Information appliances and services that can provide access and services in a scalable, efficient, and interoperable way.

(3) Information access techniques that can enable efficient searches of large distributed information repositories, making the myriad of information resources understandable.

(4) Multimedia information technologies that can, for example, synchronize and integrate real-time delivery of voice and video, and can support search and retrieval based on image content.

(5) Infrastructure for application development that can provide common solutions.

(6) Technologies that are dependable and manageable.

(7) Technologies that are easy to use and services that are accessible by users with widely varying skills, experiences, abilities, and backgrounds.

(8) Interoperability among heterogeneous systems will be required on an unprecedented scale.

(9) Security and privacy technologies that are easy to use and provide appropriate levels of security to suit the requirements, cost constraints, and convenience of the end user.

(10) Technologies and services that provide portability, mobility, and ubiquity .

The NII Report goes on to say that the Federal Government has at least two roles to play in the development of the NII:

Devise and implement effective policies that enable the development of a coherent infrastructure, while allowing competitive market forces to drive the creation of products and services;
Foster and support a long-range research program that addresses the many technical problems.

In addition, the report states that NII research should be guided by pilot projects to develop appropriate technologies, evaluate them, and get them into the hands of users.

What are Digital Libraries?

Digital libraries provide the critical information management technology for the NII, and at the same time represent its primary information and knowledge repositories. In other words, digital libraries are the core of the NII. The information services, search facilities, and multimedia technologies of items (2), (3), and (4) above constitute the digital libraries technologies. Like other NII technologies, they must provide for dependability, manageability, ease of use, interoperability, and security and privacy (items 6, 7, 8, 9). The information repositories mentioned in (a) at the beginning of the previous section constitute the contents of the digital libraries.

Notice that this new notion of a "library" is broader than the traditional view. In particular, information does not have to be processed by a human (e.g., catalogued, approved, edited) before it can become part of the library. Nevertheless, we expect that there will be some repositories with controlled collections.

Also notice we are using the plural term "digital libraries." We do not expect to see a single digital library in the NII. Each information repository will be managed separately, possibly with different technologies, and hence each will constitute a digital library. However, it will be possible to (and actually critical) to integrate "virtually" separate libraries into a single one, by providing a software layer on top of the libraries.

Digital Libraries Research Agenda

Given our definition, the research agenda for digital libraries is the research agenda of the NII. However, it is useful to reorganize the research topics listed above to highlight the critical library specific issues. Thus, we propose the following classification of digital libraries research and development problems.

Conceptually, there is a single problem to be addressed by digital libraries -- that of information discovery. How do we put a user "in touch" with the information that is of interest to him? All other problems are subsets. We may have to pay to get the user his information (item DL8 below); we may have to scan in the information before we can provide it (item DL7 below); or we may have to search through heterogeneous repositories (item DL4 below). And, incidentally, the information that a user wants may not be in a repository, but may have to be created by a service. But at the highest level, our goal is "information matchmaking."

To provide users with the information they want, we need to address the following interrelated research:

(DL1) Understanding user needs. Before we can find the information, we need to know what the user wants. We need to develop expressive query languages and user interfaces that allow a user to describe, naturally and accurately, his information needs.

(DL2) Resource discovery. An initial step in the actual matchmaking process is to find out what digital libraries are available and may have relevant information or services. The challenge is to characterize the information contents (e.g., meta-information) and service capabilities of libraries in a compact and meaningful way.

(DL3) Information retrieval. If the desired information is in one or more repositories, we have to find it efficiently, without also retrieving irrelevant information, and without missing anything relevant. To do this, we need mechanisms for identifying the relevance of information to a given user request, as well as access structures to perform the identification and retrieval efficiently.

(DL4) Heterogeneity. Information will be stored or provided by digital libraries using different commands, and will be returned using different representations. Standardized commands, protocols, and models will help, but we expect that there will always be a significant level of heterogeneity. Thus, we need to develop technology for interoperation between digital libraries, that will allow searches and interactions to span multiple libraries.

(DL5) Scale and distribution. We expect dramatic growth in the number of digital libraries, the volume of information in them, the number of users, and the number of requests. We need to develop technologies (for the rest of the items addressed in this list) that will scale and that will work efficiently in spite of the wide geographic distribution (and possible temporary disconnection) of the information resources.

(DL6) Information input and collection building. Clearly, the information in the digital libraries must enter the system somehow. Some of it will come from conventional media such as printed documents or videotape. We need to develop mechanisms for digitizing this information accurately and efficiently. Other sources of information will be the library users themselves, and hence we also need natural and easy-to-use mechanisms for generating new information, as well as for annotating or modifying existing information.

(DL7) Preservation. Some of the digital information needs to be preserved for future generations. The key challenge is to ensure that at least one copy of the medium that holds the information (e.g., the tape or CD-ROM) physically survives, that that medium can be read in the future, and that the digital information can be interpreted properly.

(DL8) Security, privacy, and charging. Before providers will make their information available, they need to be assured that they will be compensated economically, if so desired, and that the information will not be accessible to unauthorized users. We need to develop schemes for protecting information without unduly interfering desired information sharing. We also need mechanisms for tracking access and charging for it, in a way that encourages providers to make even more useful information available.

Research Priorities

We believe that all of the research problems of the previous section are important and must be addressed. However, from a short- to medium-term perspective (e.g., 5 years), we believe there are some problems that are more critical to the potential of digital libraries, and to the continued support and interest from funding agencies and the public in general.

Currently, commercial information vendors such as Knight-Ridder's Dialog and Mead Data provide a basic but very useful level of functionality over significant collections. For example, Dialog provides access to over 400 databases, many of which full text. They also provide some simple but powerful resource discovery tools for identifying relevant databases in their collections.

Most critical. We believe that it is critical to provide at least this level of functionality (e.g., boolean queries) over heterogeneous collections, geographically distributed at several organizations, with some level of resource discovery. This will let us demonstrate that investments in digital libraries will be sharable across organizations. At the same time, it will let us offer a useful service. Even though the offered user interfaces, query languages, and so on, would be limited, we know from experience with the commercial vendors that they can provide an extremely useful service. For this work, the critical research problems that need to be addressed are those of items DL2 and DL4.

Next most critical. Next in priority (short to medium term) are scalability (DL5) and security and charging (DL8) problems. Again, in terms of rapidly demonstrating the potential of digital libraries, it is important to provide access to valuable and useful information, which requires security and charging mechanisms. Similarly, we need to demonstrate access to significant volumes of information. Current commercial systems are already struggling with scalability problems, so if we want to go beyond the sizes of their collections, it is important to develop scalable mechanisms.

Not critical. The rest of the research topics are not on our critical list. This does not mean they are not important. They are, and it is essential to continue research on those issues. However, we believe that the short-term payoff from being able to, say, scan more documents (beyond what is already being done commercially), or to provide slightly better precision and recall (over what current information retrieval techniques achieve) is lower than the payoff from solving the other problems.

Infrastructure

A single organization can build a single digital library. However, to share information across libraries, it is important to have a common infrastructure that facilitates such sharing. Furthermore, this same infrastructure can also support sharing of technologies used to build the digital libraries.

In our view, the infrastructure for digital libraries should have the following components:

(IN1) Shared information representation models, service representation models, and access protocols. These will facilitate the sharing of information and services across digital libraries.

(IN2) Information "content" sharing agreements. This will take the form of communities of organizations that agree to share their collections. Initially, the sharing may be free, but eventually the communities will institute common charging schemes. The communities will also provide rules for having additional members join.

(IN3) Resource directories. To facilitate resource discovery, the infrastructure should provide "directories" that describe available information resources and the models and protocols they follow, and characterize their contents. Similarly, technology directories could be provided to help in sharing of developed technologies.

(IN4) Coordination forum. The goal of this forum is to coordinate national research and development activities. It could provide help in organizing workshops, conferences, and newsletters, whose goal would be to define further the national digital libraries infrastructure. It could also provide a mechanism for circulating and commenting on proposed "standards," similar to the RFC mechanism.

It is important to understand that the digital libraries infrastructure is neither centralized nor a single entity. It is a collection of agreements and distributed (or replicated) meta-knowledge repositories that support digital libraries research and development.

The infrastructure is also not a "pilot project" as described in the NII Report. The pilot projects will build some of the initial digital libraries (or will integrate collections of libraries) aided by the libraries infrastructure. Clearly, it is very important to have such pilot projects to demonstrate potential and achieved results.

Since the digital libraries infrastructure plays such an important role, we believe it is essential to put its initial components in place as soon as possible, say within the next two years.

Evaluation

Many of the national challenges have clearly quantifiable goals, and this makes it easy to evaluate progress. For example, one can measure the number of instructions per second required in a new processor design, or the plasma temperature needed for nuclear fusion. With digital libraries, we are not as fortunate.

There are of course some metrics that can be used, but we think their use is limited. For instance, one can count the number of documents scanned into a library, or the number of queries run at a server. But the raw number of documents or queries does not reflect the quality of the information. Furthermore, it is not easy to come up with an overall meaningful target for those metrics.

Traditional information retrieval metrics such as precision and recall also are limited when applied to huge heterogeneous collections of information. In particular, it is difficult to say what the representative queries are, and it is hard to know how many documents were missed by a particular search. (It could take a lifetime to examine manually all reachable documents to see which were actually relevant.)

So, even though the traditional metrics will be useful in some cases, we will not be able to rely on them to evaluate progress in digital libraries. In a way, the situation is analogous to the World Wide Web (WWW) a few years ago. At that time, it would have been impossible to predict how successful the WWW would be, and how it would be used. A formal evaluation of the WWW a few years back could have easily concluded that it was not useful, for example, in terms of number of queries run, or in terms of how useful the documents were to a particular search.

Given the limitations of traditional metrics, we believe that the most promising approach is to evaluate informally what users are doing, and what services they want. This could involve informal interviews or feedback sessions with researchers implementing new services or libraries.