Imagine yourself in a library without the librarian. How would you get to that book you were looking for all this while? It’s practically impossible. Search engines, like the librarians, index the websites on the World Wide Web and store thousands of web pages in their database and present a list of web sites based on the word/words, known as “keywords”, typed in by the user. The major search engines today are Google, MSN Search, AltaVista, AOL Search and All The Web. These search engines make your search for the right site or content much easier and quicker.

The Crawling Process:

Search engines make use of a software program called “spider” or “robot” which “crawls”- parses through the text of thousands of web pages across the web to give the results. Robots read through the text in order to verify the relevancy, reliability and importance of the content of web pages, index them and store them in their database. Search engine robots can’t read through graphics, frames, flash or similar other technology. The process by which the search engines read and store the web pages in their database is known as the “crawling” process. Crawlers move on from one web page, mostly starting with the home page, to other pages through the links provided on the previously crawled web page. Robots start their search with directories which are websites having a large number of links to different websites categorized under different categories by human editors. Hence, sites that don’t have any inbound links or not listed in any of the directories are not crawled by the robots.

Indexing Algorithm:

The different parameters on the basis of which the web pages are indexed is called the “indexing algorithm”. Crawlers crawl the web pages using this algorithm. One such indexing algorithm PageRank™ is used by Google. These algorithms take care of different criteria like the title of web page, the use of keywords in the title as well the way they are being used in the content, the relevancy of the content related to the keyword typed in, how many sites link to that website and many such other parameters while giving the results. Different search engines have different indexing algorithms and hence, give different results for the same keyword.

The spiders crawl a website periodically and do note the changes made in any of the pages. Even while giving results search engines give importance to those sites, which update themselves frequently. There’s an optional tag called “Revisit tag” that can be used to specify the time period after which the search engines should crawl your page provided you update your content within the specified time period. It may seriously affect your search engine rankings if the robot revisits your site based on the revisit tag but still finds the old content. Spiders update their index on finding an updated web page though the index updation may not occur as soon as the updated web page is crawled.

Managing the spider:

The information regarding which of your pages should be indexed by the robot and which shouldn’t be indexed can be provided within the Robots Meta tag, a special Meta tag placed within the head section. In addition to this, the links on a web page that a robot should or shouldn’t follow can also be specified within this tag.

Through the file named “Robots.txt”, written in any text editor, we can control which portions of the web page should be crawled by the robot and which portions shouldn’t be crawled. You can also specify the robots that should crawl your website since a robot on visiting a site first of all reads this file. You can also prevent a specific search engine from crawling your site using this. Thus this file works as a firewall between your website and the different search engine robots.

Thus search engines basically use different programs to find, index and list out the web sites available on the World Wide Web.

