Outline & Key Points
- Size of the Web: Estimates the publicly indexable Web at 800 million pages, 6 terabytes of text data on 2.8 million servers as of February 1999.
- Method: Random sampling of IP addresses, then removed servers behind firewalls, requiring authorization, or containing no content. Then crawled 2,500 random servers to try to obtain all pages on those servers. The 800 million number derives from the average number of pages on those 2,500 servers and their estimate of 2.8 million servers.
- Search Engine Rankings: As reported in the study, from largest to smallest with percentage of estimated 800 million pages: Northern Light 16%, AltaVista 15.5%, Snap 15.5%, HotBot 11.3%, MSN Search 8.5%, Infoseek 8%, Google 7.8%, Yahoo (their Inktomi database) 7.4%, Excite 5.6%, Lycos 2.5%, EuroSeek 2.2%.
- Based on: 1,050 searches from NEC Research Institute employees. Then, only documents that actually matched search criteria were included.
- Invalid Links: In a table, it gives a percentage for invalid links (defined as pages that no longer exist or have moved). It does not include pages which timed out, although those are often dead pages as well.
Northern Light 9.8%,
Yahoo (Inktomi) 2.9%,
MSN Search 2.6%,
- Age of documents: The study also evaluated new documents found by searches repeated daily. Then the date the new document was found was compared to the date the document was last changed. This gives a sense of how long it takes for a search engine to find new pages. The mean and the median scores were quite different and demonstrate that it takes several months for search engines to include new pages in their database.
- Overlap: While the article states that "The overlap between the engines remains relatively low" it does not provide any more details or an analysis of the overlap between the search engines.
- Kind of Information: According to the article, they classified the servers into by type of information: scientific/educational, pornography, government, health, personal, community, religion, and societies. However, the article does not specify the details of how this classification was done. Is all information on a university's .edu domain defined as scientific or educational? Many of the press reports looked primarily at this aspect, especially the percentage of porn covered.
Search Engine Showdown Analysis
This study is based on a significantly larger and broader set of examples than the regular Search Engine Showdown analysis. Even so, its findings are remarkably similar and in general support my findings in terms of relative size, overlap, dead links, and change over time. In the Nature study, Snap ranked a bit higher than my usual findings.
The study goes well beyond what is covered on this site in the regular Search Engine Showdown analysis by estimating the public indexable Web size and classifying the type of information available. However, the article still leaves some unanswered questions about the study, its methodologies, and its limitations. What kind of queries were used? Were all of the variations in processing queries taken into account? Hopefully, Lawrence and Giles will post more details on their Web site than could fit into the article in Nature.
Other reports and views of the article