4th Media 
  Home |   Link Building Services | On-Page SEO | SEO Knowledge Base | SEO Tools | About Company  

How Search Engines Work

Bruce Zhang
2004-02-26

Components of a Search Engine The search pages for three major search engines (Google, Yahoo search and MSN search) are simple, clean and miniature. There're lot of things going on within a search engine company behind scene. They're busy with doing three things - collecting documents from the Internet (crawling), analyzing the documents collected (indexing), and serving Web users' search requests. Each of those major tasks are performed by corresponding software component of a search engine.

  • Crawler or Spider This is a bot to collect documents on the Web.
  • Indexer A software application analyzes documents and generates searchable indexes for the documents.
  • Query Server A software system responses to user query and returns relevant documents.

Crawler Writing a simple crawler is not particularly challenging. To crawl billions of pages effectively, however, a crawler needs to make two major challenging decisions:

  • What Page to Crawl Each search engine uses different criteria to determine what pages to crawl. Google will not include a page if it's not linked by indexed page(s).
  • Frequency of Updating Google updates pages with higher PageRank values more frequently and crawls homepage of a site on a daily basis for most of sites to check for fresh content. Google also does a monthly scan of the entire Web. Meaning that every page (linked) has a chance to be updated in every month. MSNbot has been more aggressive than Googlebot since its launch of new search engine. Yahoo Bot (Slurp) has been relatively slow on indexing new content.
In order to compete with Google, Yahoo and MSN have to aggressively expand their site of index database.

Indexer This is the heart of a search engine. The indexer performs two major tasks. First, it generates a list of page characteristics to summarize a document. Second, it produces a weighted searchable keyword list from the characteristics of the document. There're many startups in search engine industry besides the search engine titans Google, Yahoo and MSN. There will be conceptual breakthrough in indexing document in order to take search technology to next level. A new way of characterizing document may eventually take one startup from unknown to a major player. The Google's success in the search industry is clearly the result of leveraging PageRank concept. In the 2005 Superbowl update, Google may placed more weight on site-wide criteria, such as SiteRank.

Google characterizes a document in over one hundred factors. Not all factors are used for ranking search results currently. The new factors are added continuously. Every major update may involve addition of new ranking factors and re-build of index database. However, the completeness of ranking factors holds the key for the flexibility of ranking algorithm tuning in response to the market change. The Characteristics of Document are categorized into four groups:

  • Page Content,
  • Back Links,
  • Page Traffic and
  • Use Feedback.
The details of ranking factors are discussed in The Factors that Impact the Search Engine Result Ranking.

Query Server Query Server or Search Engine Software serves users' search requests. This is the part of the search engine that is visible to Web users.

  • First, it tries to understand and interpret users' search terms. This is likely the competitive front for search engines to improve the quality of search results in near future. Unless search terms are properly interpreted, search engines can't find the most relevant documents. The way how search terms are interpreted will partly determine how documents are indexed in search engines.
  • Second, the Query Server will try to retrieve and rank relevant documents.
The fundamental reason that search engines sometimes fail to return relevant documents is the lack of understanding of search relevance. The quality of search is about quality and relevance of pages retrieved. Google has taken the search experience to a new level by using simple yet elegant PageRank algorithm for measuring importance of a Web page. Observations support the idea that the Superbowl update improved relevance and quality for more general search terms, but the relevance and quality for more specific search terms have been decreased as the result of the Super Bowl update. The quality and relevance gaps between Google and other search engines are getting more and more closer.



Related Topics

 



 
8 Inverness Drive East, Suite 130, Englewood, Colorado, USA
Copyright © 2005-2011 4th Media. All rights reserved.