 Web-scale Discovery Services: The Next Step in the Evolution
of Improving Access to Library Collections

By William Miller

The latest development in helping users to search for information, “web-scale discovery services,” represents a significant improvement in access compared with the online catalog alone, and the multiplicity of indexes and databases which libraries have been asking users to search separately, or in small groups via “federated searching.” These services, principally being offered by Serials Solutions, EBSCO, OCLC, and ExLibris, as well as by several other vendors, under a variety of names, all look similar to the end-user, although the internal workings may differ considerably. What the user will see, at first, is a Google-like single search box (see illustration, next page), enabling users to enter a search as they would in Google, without having recourse to a controlled vocabulary or list of standard subject headings, except that instead of searching the free sites on the internet, they are searching the library’s carefully acquired (and expensive) holdings, in all formats, or indexes and abstracts of high quality, all in the same search. Additionally, with a few clicks, they can go right to the electronic full text, in most cases.

These web-scale discovery services are best thought of as a giant index which often (though not always) offers full-text access to quality resources. We need not dwell here on the spotty, uncertain value of searching using Google and its cousins; there is some gold to be found on the free, open web, but dross abounds, especially when looking for academic resources. Even so, users have voted with their feet (or rather, with their fingers, in this case), to the detriment of comprehensive searching of peer-reviewed, high-quality resources. The rise of Google and other internet search engines has escalated users’ expectations for comprehensive, immediate access to the full text of resources, without a concomitant concern for quality of results.

Web-scale discovery services are one answer to this problem.  One of the greatest frustrations for libraries is that only one or two percent of their users typically begin their search within the library’s resources; most start with Google, and never get beyond it. As a result, libraries lose the battle for their users’ attention before it even begins, users lose out on the opportunity to tap the library’s resources, and institutions lose out when the money spent on proprietary resources is wasted as they go unsearched and unused.  Web-scale discovery systems present a Google-like appearance, but using them turns up a much higher average quality of results, and leverages the investment which the library has already made in subscribing to and purchasing academic resources.

What is in a Web-scale Discovery System?

Pre-indexed collections of journal articles and other resources are the basis for web-scale discovery systems. These can number perhaps a half a billion discrete electronic items. Libraries will want to know how many items and which periodicals are encompassed by the preharvested, centralized index. These articles could never be individually cataloged within a library’s online catalog, so they are simply not discoverable through a search of the library’s own online catalog (typically, only the titles of journals themselves are cataloged, and not the articles within them). Libraries which subscribe to the journals or packages of resources included in the discovery system can make the entire corpus of articles or other material searchable at one time, so gone are the days that someone would search Psych Abstracts, for instance, but not be searching other indexes and abstracts, and thereby miss relevant articles from other fields, even though the library pays for access to these other resources. This comprehensive search access continues to improve over time, as the web-scale discovery service providers sign agreements with more publishers and databases, and add their collections to the totality of the corpus.

Full text access will vary depending on the resources a library has paid for. For instance, a library which is subscribing to Web of Science will have searchability and full-text access to the contents of 10,000+ journals, including open access titles and 110,000 conference proceedings. None of those items (with the exception, probably, of a handful of conference proceedings) would have been individually cataloged and discoverable in the library’s online catalog, even though the library is paying for the materials. Libraries subscribing to LION (literature Online) and to a web-scale discovery system that indexes it will gain enhanced discoverability to the full-text of more than 300,000 works of poetry, prose, and drama in English, along with online literary criticism, a treasure trove largely inaccessible through the catalog.

The library catalog. Of course, we do want people to search the holdings in the catalog also. However, we do not want to make that a separate, additional step. Libraries can have their cataloged information loaded into their discovery service, thereby making the book collection just as “findable” as journal articles, all in one search (and e-books become immediately available in full text as a result). In the same vein, libraries can load the metadata for a variety of other resources, from government documents to dissertations, archival collections, and other materials which the library might have taken the trouble to digitize, such as campus lectures and cultural performances. As a result, the discovery tool truly becomes, in effect, a fairly comprehensive one-step search of the total library collection, and enables users to be much more efficient and effective in their searching.


Effective Use of a Web-scale Discovery Service

Like a Google search, however, an unfocused or very broad search in a library Web-scale discovery service will turn up an impossibly high number of results, and must be refined to be useful, unless one is searching for a known item with complete information, or is simply looking to turn up exploratory results. As an example, I tried an article that I wrote in May of 1984, entitled “What’s Wrong with Reference?” published in American Libraries, volume 15, number 5. This article exists in the databases Academic OneFile and Academic Search Premier, as well as in the JSTOR Arts & Sciences VI Archive Collection and Library Literature & Information Science Full Text, so it should be discoverable.  However, doing so proved next to impossible without the full author name and title. Entering “miller what’s wrong” [typically the amount of information someone searching for this article with a vague recollection of the author and title might remember] in my own library’s Serials Solutions Summon web-scale search system (which we have named “SearchWise”) produced 6,047,590 results, all but 2 or 3 of which are irrelevant to my search, partly because “Miller” is the sixth-most-common last name in America.

Discovery systems, however, do offer considerable opportunity to narrow a search, by date, language, peer-reviewed or scholarly publication (as defined by Ulrich’s guide to periodicals), and subject terminology. Format limiting is also important. One can limit a search to full-text online. Using a list of “facets,” one can exclude various classes of material (book reviews, newspaper articles, trade publications) and one can specifically include one kind of format, while excluding others. One can also limit to items in the library’s catalog only, or broaden the search considerably to the entire corpus of the search service, discovering items which the user’s library does not own or have immediate access to. All of this takes some familiarity, and librarians can help users refine their searches to avoid frustration and enhance the chances of success.  

Considerations when Choosing a Discovery Search Service

Cost. Cost is obviously a concern with any new major acquisition, and institutions will have to decide if the improved access is worth paying for, given that the annual subscription cost might displace some content such as journal subscriptions, book purchases, or an electronic database. There are set-up costs, annual subscription costs, and the cost of continuing data loading, maintenance, and trouble-shooting. As libraries add content or drop it, the index must reflect the state of the library’s holdings.  Once users experience and become used to web-scale searching, it will not be practical to retreat from this level of service, so this subscription is a permanent new commitment, if not to a specific search service, then to the continued access afforded by one of the other competing search service products.

Content to Match Collection. The content indexed in the central repository is important, and will vary from service to service, depending on which publishers the particular service has been able to reach agreements with. Discovery services have to sign agreements with individual publishers and database creators in order to include their content, and the array of agreements differs from service to service (of course, these are constantly growing and changing also). These agreements govern how accessible the content of the databases will be.

Libraries need to consider how well what is indexed by a particular service matches up to the subscription content and academic programs of their given institution. Is it relevant to your library that the central index includes coverage of the Journal of Bone and Joint Surgery or the journals, working papers, and reports contained within GIGA, the German Institute of Global and Area Studies, or the Chongqing VIP Chinese Science and Technology Database, a massive collection of more than 2 million Chinese newspaper and journal articles, along with articles from 5,000 foreign journal titles? If so, the discovery service that covers this content and allows your users to get at the full text you are subscribing to becomes one that is more valuable to you than those that do not. If an institution does not have programs needing this indexing, then the coverage is largely irrelevant, and might even be considered undesirable.

Depth of Services. Also worthy of consideration is the depth of the abstracting and indexing services included within the search service. Some search services include table of contents information from publishers or from other sources, subject headings from a variety of sources, and indexing of the full text of items, provided by content providers themselves. Some services allow libraries to add the indexing and abstracting services to which they subscribe to their profiles, to maximize access, although this access will provide direct linking only to the items which the library owns. Some services can offer “rich indexing” which indexes the entire full text, rather than simply subject headings and other search fields.

Different Vendors. A related consideration concerns which vendors a library is using for some or all of its full-text content. If a library is heavily dependent on databases owned by or mounted on the EBSCOHost platform, then subscribing to EDS, the EBSCO Discovery Service, will be attractive because certain barriers to moving from the indexing to direct access will be minimized or eliminated. If a library already has its journal holdings maintained in the Serials Solutions knowledge base, and is a heavy subscriber to Proquest databases (Proquest being the parent company of Serials Solutions), then the spadework necessary to link up the holdings with Serials Solutions’ Summon system will be minimized. If a library is using ExLibris’ or Innovative Interfaces’ online catalog, then those companies’ discovery systems become more desirable and easier to implement.

Searching Algorithms. The searching algorithms of the systems differ, and the differences may be important. Most users will not spend much time looking beyond the first page of search results, so relevancy ranking is important. Test searches will reveal which system produces the most satisfactory results. For instance, would searching for a known item with full author and title information result in highest ranking in the results page, or would these results be buried within a sea of false drops, many pages down? Are full-text items ranked higher than references to items not owned by the library? Does the service offer “direct linking” to the full text of material to which the library subscribes, or does the service offer only Open URL link-resolver access which requires additional steps to get to the full text of important content? Several services offer direct linking also to e-books owned by the library, allowing users to bypass the step of being directed first to the library catalog for access to such items.

Effect of Increased Accessibility. Another concern is the effect increased accessibility will have on a library. Libraries may choose to include indexing and abstracting information for items which they do not own or have subscription access to, in the web-scale discovery service. Searching such information will result in increased success in terms of turning up potentially useful items, but it will also mean more interlibrary loan activity, and potential frustration, for users who turn up relevant items to which their library does not subscribe or own.

Libraries need to decide how much of that indexing-only information they really want to offer. For instance, there are several million open access full-text online books available in the Hathi Trust database, and it makes sense to offer the indexing to that free content, but there are many more millions of items in Hathi Trust which are not available without subscription costs or interlibrary loan costs, and libraries will have to decide whether or not to include indexing to this material also, if they choose not to be full members of the Hathi Trust.

The impact of choosing a web-scale discovery service on reference and instructional activity will be considerable. Libraries will need to reorient their instructional efforts to help people make maximum use of this new way of accessing resources, including training people in how to construct and hone their searches using the system’s facets for maximum success.

A public relations effort will be necessary to familiarize people with this new tool, and a process will need to take place also with users, and in particular with faculty, who may object to this megasearch approach, as opposed to simply searching the one database with which they have been familiar and comfortable using thus far in their careers.

Cautionary Notes

 Providing a seemingly comprehensive point of access can reinforce a user’s assumption that he or she is indeed searching everything a library has, or has access to, when in fact that may not be the case. Items which are not in electronic form, not yet cataloged, and not given much discoverability via their available metadata will not appear in the results, nor will subject guides and lists of recommended resources created by the library, unless special efforts are taken to include these resources. Search services are increasingly making inclusion of LibGuides and other resources possible as searchable items. EBSCO’s EDS and OCLC’s service (a piece of their broader all-encompassing library management system called OCLC WorldShare Management Services) allow libraries to integrate A-Z database and journal lists into their indexes. Of course, libraries can always catalog anything they wish to see included in the discovery service results.

It might occur to some libraries to save money by canceling indexing and abstracting services, because the web-scale discovery service already contains the indexing for the resources they have. This will limit users, however, to the owned/subscribed content only, rather than to the wider universe of resources which users may want to tap through interlibrary loan and other means. Moreover, certain users will still prefer to use a specific index or defined group of indexes in a federated search, as a more targeted approach to their needs. EDS offers an “integrated search” option which allows users to search databases not included in the index, and to limit their searches to selected databases only, rather than encompassing the entire index.

Until libraries can integrate their catalogs with their discovery layers, an undesirable disjunction will remain. As discovery system vendors begin to offer their own online library catalogs/management systems, as several already do and others are on the verge of doing, such integration should be possible. If the two are integrated, users can do things like place holds, save searches, and submit interlibrary loan requests directly for items which turn up in the web-scale search but are not owned by the library. Theoretically, the concept of the catalog itself could disappear, as far as users are concerned, or rather simply be part of a continuum of information and services.

There is still a great need for librarians to influence publishers and content providers to share their metadata with all discovery services, and for discovery service vendors to make the contents of their products even more transparent and interoperable. Libraries can advance this process by adopting acquisitions policies which give preference to vendors who make their metadata available to all web-scale discovery services.

There is, as yet, very little assessment information available regarding the efficacy of web-scale discovery services. Although we are assuming that these have greatly enhanced user access, little of a concrete nature has been published so far, and little is known for certain. Much needs to be done to explore how the search services affect user behavior and enhance (or impede) research. Vendor representations tend to be inflated and cannot be relied upon as factual.

Actualizing the Potential of our Paid-for Resources

Even given all these cautionary notes and the fact that we are in the early stages of implementation of web-scale discovery search services, the fact is that hundreds of libraries have now adopted these services and are betting that they will revolutionize the user experience. In truth, we have no choice but to compete with Google, Bing, and others by offering our own vetted versions of universal search.

Moreover, only through the successful adoption of web-scale services can we actualize the potential of much of what we are already paying for. This actualization is especially important for materials which the library itself has digitized, and which are not normally discoverable otherwise. Having gone to the trouble of creating an institutional repository of digitized material, as many hundreds of libraries have done, consisting of theses and dissertations, faculty papers, university archives, and collections of local history and rare resources, web-scale search services give us the perfect way to integrate these hard-to-find materials into our panoply of educational resources, and leverage the enormous amount of work that went into their creation.

    Web scale discovery is the inevitable next step in library services. However, these services are still developing and institutions must be aware that they are not a panacea. Search results can be overwhelming and intervention via reference and instruction is needed. Implementation can be labor intensive and continuing vigilance is necessary to assure that the service is really covering the resources the library is paying for. The questions we ask vendors are critical, but ultimately, it will be the questions that libraries ask themselves which will lay the foundation for successful implementation of these new comprehensive tools which will allow us to offer much better access to our collections than was heretofore possible.—miller@fau.edu


