What is Crawler?
Crawler-An automatic function of some search engines that index a page, and then visit subsequent pages that the initial page links to. As the cycle continues over time, search engine crawlers or “bots”/ “spiders” can index a massive number of pages very quickly.
A crawler is a program that visits Web sites and reads their pages and other information in order to produce entries for a search engine index. The major search engines on the Web all have such a program, which is also known as a “spider” or a “bot.” Crawlers are typically programmed to visit sites that have been submitted by their owners as new or updated. Entire sites or specific pages can be selectively visited and indexed. They apparently acquired the name because they crawl through a site a page at a time, following the links to other pages on the site until all pages have been read.
Web crawlers collect information such the URL of the website, the meta tag information, the Web page content, the links in the webpage and the destinations leading from those links, the web page title, and any other relevant information. They keep track of the URLs which have already been downloaded to avoid downloading the same page again. A combination of policies such as re-visit policy, selection policy, parallelization policy, and politeness policy determine the behavior of the Web crawler. There are many challenges for web crawlers, namely the large and continuously evolving World Wide Web, content selection tradeoffs, social obligations and dealing with adversaries.
They are the key components of Web search engines and systems that investigate web pages. They help in indexing the Web entries and allow users to send queries against the index and provide the webpages that match the queries. Another use of Web crawler is in Web archiving, which involves large sets of webpages to be periodically collected and archived. Web crawlers are also used in data mining, wherein pages are analyzed for different properties like statistics, and data analytics are then performed on them.
« Back to Glossary Index