|
SEARCH ENGINE SURVEY An overview of the mapmakers of cyberspace Gisle Hannemyr |
|
At the time of writing (July 2000), more than 700 search engines can be accessed through the Internet, most of them by means of a web page that allow users to set up searches and view a ranked result.
This web page is part of an continuing effort to create a survey of some of these search engines (in particular those targeted towards locating information) and the multifarious data repositories and informations spaces they allow people to search in.
A search engine is software system that assists people in locating resources. An important class of search engines are those whose primary function is to search for information. They are by now means ubiquitous, a large (and growing) number of search engines assists people in locating other things than information, such as jobs, services and physical objects (e.g. books, CDs, online auction items, houses, cars and even other people). Such search engines fall outside the scope of this survey.
There is a lot of information (or data1) in the world. Not all of it can be located through the Internet (at least not yet), but a lot of it can now (at least in principle) be found by by means of some Internet search engine.
Search engines can be thought of as mapmakers in information space. They explore the landscape of information and create maps in the shape of internal structures that are supposed to help travellers find their way in the chaotic warehouse of superabundant information that most people see when they first are confronted with the World Wide Web.
Not all search engines map the same information space. Some search engines map the resources that can be downloaded from open ftp (file transfer protocol) repositories around the globe, other the resources that are resident on the World Wide Web (technically, this means that they map resources visible through the http, hyper text transfer protocol, since this protocol is what techically defines which part of the Internet landscape belongs to the World Wide Web). A third protocol that also defines a clearly delineated information space is nntp (network news transfer protocol), which techncally defines what is known as "network news" (not to be confused with what consititues "news" in "old media" such as television and periodicals) or more descriptive "Usenet discussion groups". To make things even more complicated, some search engine not only explore information spaces delinated by technical protocol, but by genre. For instance, they may monitor and map breaking news (in the "old media" sense) distributed by wire services and/or news oriented media on the World Wide Web.
Generally speaking, search engines either delimit the information spaces they map by technical protocol, by genre, or both.
In the survey, I've indicated this by creating categories accordingly. There are three clear-cut catogories corresponding to technical protocol (web/hhtp, usenet/nntp and ftp/ftp). In addition, some search engine provide access to proprietary data from sources outside what is openly available on the Internet.
Genre-oriented classification is much more difficult, as this is subjective. Tentatively, I've created six genre classes (named: wire, legal, science, reports, reference, and directory), but are always willing to listen to someone who can propose a good alternate system.
Brief description of the four + six information spaces:
Being aware of which information space a particular engine map is crucial if you want to make efficient use of the engine. If you are looking for a specific company's home page on the World Wide Web, you have a much higher chance of success if you use a search engine that map the World Wide Web, rather than one that let you search the information space made up of newswire telegrams and newspaper articles. On the other hand, if you are looking for breaking news about the Arab-Israeli peace process, there is probably little point in searching the World Wide Web (it is impossible for general search engines to keep up with current affairs on the World Wide Web), and you would be better of using a search service that specialises in monitoring recent news.
To represent the information spaces and other classifications, the following scheme is used to encode the result onto an easy to read tabular form:
| I.S. | Coverage | Access | Hosted | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Where | Extensive | Proprietary | NH | Not Hosted | ||||||
| Genre | Partial | Partially open | PH | Partially Hosted | ||||||
| None | Open | H | Hosted | |||||||
The I.S (Information Space) encoding combined with coverage indicates where the search engine finds information, and what genre of information it covers, and to what extent it covers that particular domain. There is basically two different types of domains, which I've called Internet and Genre respectively.
The access column indicate what type of access restrictions (if any) there are to the site. An open site allows full, free and unrestricted access to all content. A proprietary site that charges a premium such as a subscribtion or a PPV (pay per view) for access to content. A partially open site has some free and some premium content.
The search engines listed below are fairly general in scope, covering a multitude of genres and sources.
| Site | Web | Usenet | Ftp | Prop. | Legal | Wire | Science | Reports | Ref. | Dir. | Access | Hosted | Rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AltaVista | 1 | PH | 5 | |||||||||||
| Deja | H | 5 | ||||||||||||
| FTP Search | NH | 5 | ||||||||||||
| 2 | PH | 5 | ||||||||||||
| Invisible Web | 3 | NH | 1 | |||||||||||
| Northern Light | 4 | PH | 5 | |||||||||||
Notes:
Metasearch engines forward queries to other search engines, extract the results, perform a re-ranking and present them through a standardized user interface.
The column labeled # shows how many resources the engine submits its request to. In most cases this number is taken from the metasearch engine homepage (as it appeared in july 2000). A question mark in this columns means that I've not been able to find out, and a question mark followed by number is my estimate.
| Site | # | Web | Usenet | Ftp | Prop. | Legal | Wire | Science | Reports | Ref. | Dir. | Access | Hosted | Rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AskJeeves | 1 | 5? | NH | 5 | |||||||||||
| Dogpile | 2 | ? | NH | 5 | |||||||||||
| I.SEE | 4 | NH | 5 | ||||||||||||
| IntelliSeek | 24 | NH | 5 | ||||||||||||
| MetaCrawler | 3 | ? | NH | 5 | |||||||||||
| SavvySearch | 700 | NH | 5 | ||||||||||||
Notes:
| Site | Web | Usenet | Ftp | Prop. | Legal | Wire | Science | Reports | Ref. | Dir. | Access | Hosted | Rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AltaVista | 1 | NH | 5 | |||||||||||
| MP3 Search | 2 | NH | 5 | |||||||||||
Notes:
The genre oriented search engines are characterised by restricting their information space to a particular genre (such as scientific papers or newswire telegrams). Many of these provides extensive coverage to propritary sources (often carefully maintained by human ontologers), while others are set up to locate genre specific resources that are freely avialable on the World Wide Web. I've removed the Usenet and Ftp catagories, since these doesn't apply in this context.
| Site | Web | Prop. | Legal | Wire | Science | Reports | Ref. | Dir. | Access | Hosted | Rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dialog | H | 5 | ||||||||||
| Froogle | H | 5 | ||||||||||
| Kompass | H | 5 | ||||||||||
| Lexis-Nexis | H | 5 | ||||||||||
Notes:
In addition, there at one time existed search engines to the now (mostly) defunct information spaces gopher and WAIS (Wide Area Information Service). They are listed here with links to the last working web gateways. Don't be surprised if they've stopped working by the time you read this.
1) Some researchers, citing Claude Shannon's seminal 1949 paper on "The mathematical theory of communication" like to point out that there is a difference between data and information. Yes, sometimes there is, but in the present context I would argue that the two terms are interchangeable.