SEARCH ENGINE SURVEY
An overview of the mapmakers of cyberspace

Gisle Hannemyr

[Feedback] [Previous] [Gisle Hannemyr's Home Page]

At the time of writing (July 2000), more than 700 search engines can be accessed through the Internet, most of them by means of a web page that allow users to set up searches and view a ranked result.

This web page is part of an continuing effort to create a survey of some of these search engines (in particular those targeted towards locating information) and the multifarious data repositories and informations spaces they allow people to search in.

A search engine is software system that assists people in locating resources. An important class of search engines are those whose primary function is to search for information. They are by now means ubiquitous, a large (and growing) number of search engines assists people in locating other things than information, such as jobs, services and physical objects (e.g. books, CDs, online auction items, houses, cars and even other people). Such search engines fall outside the scope of this survey.

Basic Concepts

There is a lot of information (or data1) in the world. Not all of it can be located through the Internet (at least not yet), but a lot of it can now (at least in principle) be found by by means of some Internet search engine.

Search engines can be thought of as mapmakers in information space. They explore the landscape of information and create maps in the shape of internal structures that are supposed to help travellers find their way in the chaotic warehouse of superabundant information that most people see when they first are confronted with the World Wide Web.

Not all search engines map the same information space. Some search engines map the resources that can be downloaded from open ftp (file transfer protocol) repositories around the globe, other the resources that are resident on the World Wide Web (technically, this means that they map resources visible through the http, hyper text transfer protocol, since this protocol is what techically defines which part of the Internet landscape belongs to the World Wide Web). A third protocol that also defines a clearly delineated information space is nntp (network news transfer protocol), which techncally defines what is known as "network news" (not to be confused with what consititues "news" in "old media" such as television and periodicals) or more descriptive "Usenet discussion groups". To make things even more complicated, some search engine not only explore information spaces delinated by technical protocol, but by genre. For instance, they may monitor and map breaking news (in the "old media" sense) distributed by wire services and/or news oriented media on the World Wide Web.

Generally speaking, search engines either delimit the information spaces they map by technical protocol, by genre, or both.

In the survey, I've indicated this by creating categories accordingly. There are three clear-cut catogories corresponding to technical protocol (web/hhtp, usenet/nntp and ftp/ftp). In addition, some search engine provide access to proprietary data from sources outside what is openly available on the Internet.

Genre-oriented classification is much more difficult, as this is subjective. Tentatively, I've created six genre classes (named: wire, legal, science, reports, reference, and directory), but are always willing to listen to someone who can propose a good alternate system.

Brief description of the four + six information spaces:

Being aware of which information space a particular engine map is crucial if you want to make efficient use of the engine. If you are looking for a specific company's home page on the World Wide Web, you have a much higher chance of success if you use a search engine that map the World Wide Web, rather than one that let you search the information space made up of newswire telegrams and newspaper articles. On the other hand, if you are looking for breaking news about the Arab-Israeli peace process, there is probably little point in searching the World Wide Web (it is impossible for general search engines to keep up with current affairs on the World Wide Web), and you would be better of using a search service that specialises in monitoring recent news.

To represent the information spaces and other classifications, the following scheme is used to encode the result onto an easy to read tabular form:

I.S. Coverage Access Hosted
          Where            Extensive            Proprietary    NH  Not Hosted
 Genre   Partial   Partially open  PHPartially Hosted
    None   Open  HHosted

The I.S (Information Space) encoding combined with coverage indicates where the search engine finds information, and what genre of information it covers, and to what extent it covers that particular domain. There is basically two different types of domains, which I've called Internet and Genre respectively.

The access column indicate what type of access restrictions (if any) there are to the site. An open site allows full, free and unrestricted access to all content. A proprietary site that charges a premium such as a subscribtion or a PPV (pay per view) for access to content. A partially open site has some free and some premium content.

Survey

General Search Engines

The search engines listed below are fairly general in scope, covering a multitude of genres and sources.

 
Site      Web     Usenet     Ftp     Prop.   Legal    Wire   Science Reports    Ref.       Dir.    Access Hosted Rating
AltaVista1                       PH 5
Deja                        H 5
FTP Search                        NH 5
Google2                       PH 5
Invisible Web3                       NH 1
Northern Light4                       PH 5

Notes:

AltaVista
AltaVista was orginally created at Digital Equipment Corporation's Western Digital Palo Alto lab as an on-the-web showcase for the 64 bit Alpha CPU developed by DEC (now a part of Compaq). Its presence on the web helped establishing search engines as part of the Internet landscape. Following the restructuring of DEC, AltaVista was set up as a separate company and is now a major Internet portal with Internet searching as only one of its many services. Among its special features is a technology called RealNames that checks the search terms entered by the user against an internal database of registred and common-law company, product and concept names and marketing slogans. If RealNames finds a match, it points you to the name-owner's WebPage. AltaVista also licenses Babelfish from the company Systran. This is useful for quick translations of foreign webpages into (something resembling) English. AltaVista also offers a media finder, that let user's search for particular media types (e.g. images, video and audio). This is described below, in the section on media search engines.
Google
Google (earlier name WebBase) started out as a research project at Stanford University. Google uses a simple, yet strikingly efficient, approach to ranking, known as "link cardinality". Google has trademarked this with the name PageRank™ and applied for a patent. What is meant by "link cardinality" is that it counts how many times sites, especially respected and well-known sites, such as Yahoo!, have linked to a particular page from elsewhere on the web. This is a variation of the recommender system approach, but instead of explicit voting, Google extracts the "votes" implicit by counting links. Google has another useful feature. Like most web search engines, it provides a hyperlink that the user may information click on to go to the information at the place of origin (the non-hosted approach), in addition, it retains a local copy of the page as it was originally downloaded and analyzed (the hosted approach). This local copy can be used both if the original page no longer is available, or if it has changed so much since the time it was analyzed by Google that the relevance ranking of the page no longer apply.
Invisble Web
The Invisible Web consists of searchable information resources whose contents cannot be indexed by traditional search engines. These include databases, archived material, and interactive tools such as calculators and dictionaries. Since these resources are embedded within thousands of individual Web sites, they are not visible to the search engines of today.
Northern Light
In addition to searching open Internet searches, Northern Light hosts a database called the Special Collection, which it describes as "an online business library comprising 6,700 trusted, full-text journals, books, magazines, newswires, and reference sources." Access to items in the special collections is on PPV basis.

Metasearch Engines

Metasearch engines forward queries to other search engines, extract the results, perform a re-ranking and present them through a standardized user interface.

The column labeled # shows how many resources the engine submits its request to. In most cases this number is taken from the metasearch engine homepage (as it appeared in july 2000). A question mark in this columns means that I've not been able to find out, and a question mark followed by number is my estimate.

 
Site   #    Web     Usenet     Ftp     Prop.   Legal    Wire   Science Reports    Ref.       Dir.    Access Hosted Rating
AskJeeves1 5?                       NH 5
Dogpile2 ?                       NH 5
I.SEE  4                       NH 5
IntelliSeek  24                       NH 5
MetaCrawler3 ?                       NH 5
SavvySearch  700                       NH 5

Notes:

AskJeeves
The AskJeeves site accepts questions in plain English: How do I scan photographs? Where can I find recipes for apple pie? AskeJeeves compares your question with its internal knowlede base of questions and answers (compiled by human editors), and finds those closest matching your question. You are then presented with a list of questions it knows how to answer, and should pick one of these (if it still bears resemblance to your original question). This question is then gransformed into a set of appropritate search requests and submitted to a number of search engines as well as (if appropriate) reference works such as the Encyclopædia Britannica.
Go2Net
Both MetaCrawler and DogPile is part of the Go2Net Network.

Media Search Engines

 
Site      Web     Usenet     Ftp     Prop.   Legal    Wire   Science Reports    Ref.       Dir.    Access Hosted Rating
AltaVista1                       NH 5
MP3 Search2                       NH 5

Notes:

AltaVista
AltaVita also features a media finder, that let user's search by keyword for particular media types, such as images, video or audio files. The search results in a pointer to the media file, and a presentation of derived metadata, such as file format and file size. For images and videos, the results page includes thumbnail illustrations. In addition to searching for media files on the web, AltaVista let users search for premium media content sold through partners.
MP3 Search
Searches for audio files on the MP3 file format in open ftp archives..

Genre Oriented Search Engines

The genre oriented search engines are characterised by restricting their information space to a particular genre (such as scientific papers or newswire telegrams). Many of these provides extensive coverage to propritary sources (often carefully maintained by human ontologers), while others are set up to locate genre specific resources that are freely avialable on the World Wide Web. I've removed the Usenet and Ftp catagories, since these doesn't apply in this context.

 
Site      Web     Prop.   Legal    Wire   Science Reports    Ref.       Dir.    Access Hosted Rating
Dialog                    H 5
Froogle                    H 5
Kompass                    H 5
Lexis-Nexis                    H 5

Notes:

Defunct Search Engines

Cora
Cora and its companion service Sara searches the World Wide Web for research papers in PostScript format from universities and laboratories. Cora's focus is computer science.
Sara
Sara and its companion service Cora searches the World Wide Web for research papers in PostScript format from universities and laboratories. Sara's focus is the field of statistics.

Depreciated Search Spaces

In addition, there at one time existed search engines to the now (mostly) defunct information spaces gopher and WAIS (Wide Area Information Service). They are listed here with links to the last working web gateways. Don't be surprised if they've stopped working by the time you read this.

Un-catalogued

Web (mainly) Search Engines

  1. All the Web, All the Time
  2. Direct Hit (recommender system)
  3. Excite
  4. Filez search
  5. Findsame
  6. GoTo
  7. HotBot (uses technology from Direct Hit for ranking).
  8. Infoseek
  9. Internet Sleuth
  10. Kvasir
  11. LookSmart
  12. Lycos
  13. NetFind
  14. Netscape Portal
  15. Netscape Search
  16. WebCrawler
  17. Yahoo! (no)
  18. Yahoo! (us)

Genre

  1. Atekst søk (norwegian)
  2. Company info., etc. (norwegian, infotorg.no)
  3. Search for science

Find Books

  1. Amazon
  2. Bibsys
  3. Library of Congress

Directory Services

  1. Adresseboken
  2. Telenor Gule Sider
  3. Telenor Telefonktalog (Uninett)
  4. BigBook
  5. Bigfoot
  6. Four11
  7. GTE SuperPages
  8. InfoSpace: The Ultimate Directory
  9. WhoWhere?

Collections


1) Some researchers, citing Claude Shannon's seminal 1949 paper on "The mathematical theory of communication" like to point out that there is a difference between data and information. Yes, sometimes there is, but in the present context I would argue that the two terms are interchangeable.