Search engines are as critical to Internet use as any other part of the network infrastructure, but they differ from other components in two important ways. First, their internal workings are secret, unlike, say, the workings of...
See more »
Search engines are as critical to Internet use as any other part of the network infrastructure, but they differ from other components in two important ways. First, their internal workings are secret, unlike, say, the workings of the DNS (domain name system). Second, they hold political and cultural power, as users increasingly rely on them to navigate online content. When so many rely on services whose internals are closely guarded, the possibilities for honest mistakes, let alone abuse, are worrisome. Further, keeping search-engine algorithms secret means that further advances in the area become less likely. Much relevant research is kept behind corporate walls, and useful methods remain largely unknown. To address these problems, we started the Nutch software project, an open source search engine free for anyone to download, modify, and run, either as an internal intranet search engine or as a public Web search service. As you may have just read in Anna Patterson's "Why Writing Your Own Search Engine Is Hard", writing a search engine is not easy. As such, our article focuses on Nutch's technical challenges, but of course we hope Nutch will offer improvements in both the technical and social spheres. By enabling more people to run search engines, and by making the code open, we hope search algorithms will become as transparent as their importance demands. TECHNICAL CHALLENGES Much of the challenge in designing a search engine is making it scale. Writing a Web crawler that can download a handful of pages is straightforward, but writing one that can regularly download the Web's nearly 5 billion pages is much harder. Further, a search engine must be able to process queries efficiently. Requirements vary widely with site popularity: a search engine may receive anywhere from less than one to hundreds of searches per second. Finally, unlike many software projects, search engines can have high ongoing costs. They may require lots of hardware that consumes lots of Internet bandwidth and electricity. We discuss deployment costs in more detail in the next section, but for now it's helpful to keep in mind a few ideas: * The cost of one part of the search engine scales with the size of the document collection. The collection might be very small when Nutch is searching a single intranet, but could be as large as the Web itself. * Another part of the search engine scales with the size of the query load. Each query takes a certain amount of time to process and consumes some bandwidth. * With these two factors in mind, we've designed a system that can easily distribute the work of both fetching and query processing over a set of standard machines. Figure 1 shows the system's components. WebDB. WebDB is a persistent custom database that tracks every known page and relevant link. It maintains a small set of facts about each, such as the last-crawled date. WebDB is meant to exist for a long time, across many months of operation. Since WebDB knows when each link was last fetched, it can easily generate a set of fetchlists. These lists contain every URL we're interested in downloading. WebDB splits the overall workload into several lists, one for each fetcher process. URLs are distributed almost randomly; all the links for a single domain are fetched by the same process, so it can obey politeness constraints. The fetchers consume the fetchlists and start downloading from the Internet. The fetchers are "polite," meaning they don't overload a single site with requests, and they observe the Robots Exclusion Protocol. (This allows Web-site owners to mark parts of the site as off-limits to automated clients such as our fetcher.) Otherwise, the fetcher blindly marches down the fetchlist, writing down the resulting downloaded text. Fetchers output WebDB updates and Web content. The updates tell WebDB about pages that have appeared or disappeared since the last fetch attempt. The Web content is used to generate the searchable index that users will actually query. Note that the WebDB-fetch cycle is designed to repeat forever, maintaining an
See less »
Kaboodle will send you a newsletter and updates from your friends. You can unsubscribe at any time. Kaboodle does not sell or share your email address or personal information with anyone.
Kaboodle requires all users to provide their real date of birth as both a safety precaution and as a means
of preserving the integrity of the site. You will be able to hide this information from your profile if you wish.
Added by 1 people