conveylive.com

A Comparative Study on the Performance of World Wide Web Search Engines

Internet has emerged as the most powerful medium for storage and retrieval of information.

Randy Trevor
Randy Trevor
Oct 20, 2009
0 Comments | 1981 Views | 0 Hits


 

WORLD WIDE WEB

 

The World Wide Web abbreviated as WWW or referred simply as the Web is a system of Internet servers that supports hypertext to access several Internet protocols on a single interface. Because of its features, and because of the Web's ability to work with multimedia and advanced programming languages, the Web is the fastest-growing component of the Internet. It is, thus a collection of millions of files stored on thousands of computers (called servers) all over the world.

 

            The World Wide Web has revolutionized the way the people access information, and has opened up new possibilities in areas such as digital libraries, information dissemination and retrieval, education, commerce, entertainment, government and health care. “The amount of publicly available information on the web is increasing consistently at an unbelievable rate”. “It is a gigantic digital library, a searchable 15 billion-word encyclopedia ”. It has stimulated research and development in information retrieval and dissemination.

The Internet

 

In recent years, Internet has emerged as the most powerful medium for storage and retrieval of information.  It works round the clock and connects every nook and corner of the globe.  “With an unprecedented growth in the quantum of knowledge worldwide and the easy accessibility, Internet has become an unavoidable necessity for every institution of higher learning and research”. It is a million of computers interconnected through the worldwide telecommunication system.  All the computers are able to share information with each other because they use common communication protocols.  Internet protocols allow many different network technologies form local area networks to wide area networks to be interconnected for information communication and its application.  It supports audio and video clips as well as the text and images.  Thus the Internet is at once a world-wide broadcasting capability, a mechanism for information dissemination, and a medium for collaboration and interaction between individuals and their computers without regard for geographic location.

 

WWW SEARCH ENGINES

One of the key aspects of the World Wide Web that makes it a valuable information resource is that the full text of document can be searched using web search engines. It is one of the many available ways to obtain information from the Internet. Search engines are mechanisms that aid user to search the entire Internet for relevant information. They use the automatic process to update, modify and maintain the references to web sites and web pages. They index all the information floating on the net; categorize them into various heads and than present for searching by the users.

They provide key word searching capability, based on the indexing of text contained within a document and deliver a list of WWW link (URL’s) that contain the key word entered in the search statement. According to Alan Poulter, “A www search engine is defined as retrieval service, consisting of a database(s) describing mainly resources available on the www, search software and a user interface also available via www”.

 

All the Crawler-based web search engines mainly consist of three major parts. The first is the Spiders, also referred to as robots, crawlers or worms. The task of Spiders is to crawl the web, to roam and move to the Internet periodically. And their goal is to find the content and information to add to the search engine’s database. Everything the spider finds goes into the second part of the search engine, the index. It is some times also referred as catalogue. It is a huge repository like a giant book. It contains a copy of every web page that the spiders find. The Index is updated for any change in the web pages. Search engine software is the third part of a search engine. This is the program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it believes is most relevant.

 

 

All crawler-based search engines have the basic parts described above, but there are differences in how these parts are tuned. That is why the same search on different search engines often produces different results.

 

 

NEED AND OBJECTIVES OF THE STUDY

 

The WWW, although a latecomer in the Internet family, has rapidly gained popularity and become the second most widely used application of the Internet.The publicity WWW has gained is so great that many people naively equate WWW with the Internet. The freewheeling nature of publishing on the Web is a blessing for the flow of ideas. As a result, the web has become a sea of all kinds of data, making any query into the huge information reservoir extremely difficult. However, “finding information on the Web is a matter of luck; often it is like one puts a hand in the heap of garbage to find a gold coin ”. 

 

 

Search engines are among the most popular tools for resource discovery on the WWW. Obviously all search engines follow different algorithms to index information on the web and to output results to a user's query. In order to be effective on the web, it is important to utilize the search engine most suited to one’s subject domain.

 

 

With the advent of huge search engines like Google & AltaVista, containing up to millions of homepages, measuring the number of hits is no longer an effective measure. The question of the quality of the hits rather than their quantity is becoming more important.

 

The freewheeling nature of publishing on the Web is a blessing for the flow of ideas, but it has also complicated the process of retrieving relevant information. In contrast to traditional IR, there are no consistent indexing and classification principles for organizing materials on the Web. Nor are there any filtering practices at hand to ensure the quality and credibility of the documents.

 

In order to overcome all these difficulty in retrieving information from WWW, a plethora or search engines have become available recently. However, since there are usually only one or two search aids for other Internet applications (e.g., Archie for FTP, and Veronica for Gopher), why have various search engines been developed for the Web so far? The sheer number invites research. For instance, what features do various Web search engines offer? How do they differ from one another in performance? Is there a single Web search engine that out-performs all others in information retrieval? The current study attempts to seek answers to those questions. Besides, the main objective of the current study is to evaluate the performance of selected search engines to locate information from www in terms of the following:

 

 

(a)                            Which search engine provides most exhaustive search of www.

 

(b)                           Which search engine provides more relevant search results as compared to other selected search engines?

 

(c)                            Which of these selected search engines is discipline supported. i. e. whether they are strong for searches in a specific subject.

 

(d)                           Which selected search engine is providing more current results?

 

(e)                            Which selected search engine is providing more non workable links? 

 

 

The reported findings of the earlier studies obviously do not appear to agree with one another. The methodologies and evaluation criteria used by those studies differed as well. Can a feasible methodology be developed to help Web users select a search engine, out of the great number of choices, which is most appropriate to their specific search needs? The researcher of this study is trying to do so by evaluating the searching capabilities and performance of selected Web search engines.

 

 

REVIEW OF RELATED LITIRATURE

 

The review of literature shows that in the realm of search engine studies, many studies comparing relevance have been conducted. However, I have chosen this field for present study because many previous studies have arrived at conflicting conclusions as to which services are better at delivering superior precision and because most of those studies have either had small test suites or have not reported how the study was conducted.

 

Most published precision studies have had test suites that were too small for statistical usefulness. Leighton's study (Leighton, 1995) only had eight queries, and there have been a host of minor studies that purport to judge precision based on three, two or even one measly query.Westera (1996) only used five queries, all dealing with wine.

Ding and Marchionini (1996)

, the best modeled study to date, studied first twenty precision, but used only five queries.

Gauch and Wang (1996) had twelve queries, studied almost all of the major search services (and even the major metasearch services) and reported first twenty precision, but did not test for significance in the differences reported.

Chu and Rosenthal (1996)

tested first ten precision, had enough queries for statistical comparisons, recorded crucial information about how the searching was conducted, and performed some statistical tests. However, they only studied three search services, and they did not subject their mean precision figures to any test for significance. Tomaiuolo and Packer (1996) studied first ten precision on two hundred queries. They did list the query topics that they searched, but they used structured search expressions (using operators) and did not list the exact expression entered for each service. They reported the mean precision, but again did not test for significance in the differences. They often did not visit the link to see if it was in fact active. Nor did they define the criteria for relevance. Despite the shortcomings, it is an impressive, huge study.

Studies reported in popular journals were often vague about how many or exactly what queries were searched. Scoville (1996) used first ten precision and gave exact mean scores, but explained neither how many queries were used, nor whether the differences in mean were significant. Munro and Lidsky (1996) also used first ten precision in a hefty fifty query by ten search engine study, but did not list the queries or the statistical analysis. From their description, it is clear that their range of types of queries was much wider than that used in this study. They reported their results as a scale of four stars, indicating that more exact numbers would be easily misleading (probably because of issues with statistical significance).

Venditto (1996)

[17] used first twenty-five precision, but did not report how many queries were used nor what the exact statistics were.

 

None of the studies in the related literature indicate for any attempt made to blind the service of origin of the links evaluated. Unless this step is taken, there must always be the question of bias, conscious or unconscious, on the part of the evaluator.

 

 

This work attempts to compare the precision of selected search engines namely Google, Alta Vista, Teoma, and Alltheweb in an objective and fair manner with regard to general subject queries that may be posed in the academic setting.

 

Methodology

 

Analysis of related literature shows that various search engine evaluative studies and researches have employed different methods and procedures, causing both appraisal and criticism.  In the present study all efforts have been made to employ a thoroughly comprehended and well-planned methodology to evaluate the performance of search engines under study. It is hoped that all sort of biases and lacunas associated with several previous studies will remain far off form the present study. The methodology employed in the present study is described below under three distinct headings i.e.

 

 

1.                  The test suite development,

 

2.                  Search method, and

3.                  Evaluation method

The Test Suite Development

The test suite development includes two steps: step one is the selection of the topics/queries to be searched for and then, step second, decide exactly what search expression to be submitted. As far as the selection of query is concerned it is the general subject inquiry in an academic setting, the queries are ones actually asked at VSAT- Facility, Central Library, Dr. RML Avadh University, Faizabad. During December 2005, the researcher recorded the topic of every reference question that was asked in which the users specifically requested that the Internet be used as a source of information. These queries were neither invented nor selected by someone who knew the abilities of the various services. Discipline wise first five received queries were selected.

 

The selection of exactly what search expression to enter is perhaps the single weakest point in the design of this study. However, these selected queries are usually narrowly defined academic topics and used multiple words and hence used as search expression without any modification.

 

As far as the search expression is concerned while conducting preliminary queries, I became uneasy and it was difficult to know for each of the selected services and for each query exactly what expression would be optimal. Furthermore, as Magellan's Voyeur (Magellan, 1997) indicates, most users do not use operators in their searching. Finally, unstructured queries force the search service to do more of the work, ranking results by its own algorithm rather than the constraints specified by the operators. Because of all of these factors, I chose the natural language as the preferred expression, and only modified the topics when the topic was too easily open to multiple interpretations. Whether I made optimal, or even adequate, choices, is an issue for criticism.

Search Method

 

The close time proximity of the searching is essential for objectivity in search engine evaluative studies. It is better that a query is searched on the different services closer in time. Ideally, a query should be executed on all search services at the same time. It is because if a relevant page were to be made available between the time one engine was searched and the time a second was searched, the second would have the unfair advantage of being able to index and retrieve the new page. In the present study, all of the search engines were searched, for a given query, at the same time. For a given query all selected search engines were opened in different windows, the search expression was given to each engine, and then within seconds the search was executed on all the search engines. Only in cases when a particular engine could not execute the search, second or third attempt was made. And in such cases also the time gap was not more than two minutes.

 

Another way to achieve close time proximity of searching is to check the pages that were cited in the results from the search services as quickly as possible after the results are obtained. The longer one waits after the results are obtained, the more likely it is that some pages which were truly active at the time of the search have been removed from the Web and are erroneously judged to be inactive by the researcher. In the present study, Netscape web browser was used to obtain all the pages. All results of each query were retrieved at the same time and stored in labeled files. One more corresponding record was also created in MS-word file, where web links and the titles of the retrieved documents was copied and stored. This was done on the same day, and at the same time, the searches were conducted. The subject experts then evaluated the stored pages over a period of from a day to a week.

Evaluation Method

 

Concerned subject experts, all faculty members at Dr. RML Avadh University, Faizabad, were invited to evaluate the contents of all retrieved and stored pages. Initially to judge all possible biases various pretests were conducted. In these pretests it was noticed that evaluators were developing biases and judgments about the various search services. To prevent these biases from clouding the judgment in categorizing individual pages returned by the services, a method of blinding the pages was developed so that for any given page, no one could know ahead of time which service had returned it as a result.

 

 

The evaluators were asked to call up each stored and labeled page, inspected it and assigned it to one of these categories: Relevant, Partially Relevant, Distant Relevant, Irrelevant, or Page Not Found. Certain unique features (such as the title & web link addresses) were noted so that a match could be established at the later stage of evaluation for similar result pages. In this way, even if the evaluation was not evenly or fairly done in other respects, at least the same page would have received the same score throughout the evaluation of a query.

 

 

Both in the method the queries were chosen and the method in which the resulting pages were evaluated, all attempts were made to prevent personal natural biases from affecting the study.

 

 

FINDINGS AND RECOMMENDATIONS

 

In this study, relevancy comparison of four world wide webs search services, which are commonly called “search engines” have been made.  These selected search engines are-Google, Altavista, Teoma and Alltheweb. The set of one dozen search statement is used from three subject fields.  The relevancy was measured on the basis of first ten search results provided by each of the search engine.

 

 

For the ease of analysis and interpretation the collected data have first been divided into two categories viz: The First Five Output (FFO) and the Last Five Output (LFO) and have been measured in terms of Relevant (R), Partially Relevant (PR), Distant Relevant (DR) and Irrelevant (IR). In all cases, when the web page could not be opened, due to any reason, it has been categorized as Page Not Found (PNF).  In this way all search output has been calculated, analyzed and results are drawn.

 

 

With a view to measure the currency of collected data all output provided have also been analyzed.  While collecting data the date of posting the web page and its data of update has been noted.  To decide the currency of web pages the positing date of web page or the date when web page has been updated has formed the basis. In all cases when the date of update is not available the date of posting the web page has been chosen.  But where revision date is given, it is selected as the currency of web page. For the ease of calculation and analysis the currency period has been categorized into four groups viz: 2006-2005, 2004-2003, 2002-2001 and 2000- Earlier. However, while analyzing the collected data it has been found that a large number of web pages are neither having the date of positing nor the date of revision or update.  All such cases have been categorized as Currency Not Available (CNA). This has created five categories of currency of web pages.

 

 

1.         It is important to note that all search engines boast to arrange the search results on the basis of their relevancy to the search statement, and it is commonly accepted that search engines with big size Index provides higher number of total results and their relevancy is also higher in comparison to others. Present study has found these assumptions false. Search engines with smaller index have provided more current and relevant results than the search engines with bigger index.

 

 

2.         In Library and Information Science (LIS) subject field Google has provided the maximum output of 45,177,500 results, followed by Altavista with output of 2,346,500 results and Alltheweb with 2,148,000 results. Teoma has provided the minimum output of 448,650 average results. As far as relevancy is concerned Google has provided 30%R, 27.5%PR, 22.5%DR, 17.5%IR and 2.5%PNF results.  Altavista has provided 35%R, 12.5%PR, 20%DR and 32.5%IR results.  Teoma has provided 32.5%R, 17.5%PR, 15%DR, 27.5%IR and 5%PNF results.  Alltheweb has provided 42.5%R, 15%PR, 15%DR and 27.5%IR results.  Thus it can be easily said that search engine Alltheweb provides the maximum of 42.5% relevant results.  A close analysis shows that for relevant web pages one should use Alltheweb, particularly in the field of LIS. 

 

 

3.         All three search engines except Google have provided more relevant results in first five output where as Google has provided equal relevant results in first five output and the last five output. 

 

 

4.         In terms of average currency in LIS Teoma has provided the maximum of 30% current results posted/updated in the year 2005-2006, followed by Alltheweb with 27.5% and Altavista with 25%. Google has provided the minimum of 20% current results posted/updated in the year 2005-2006 on average. 

 

 

5.         Similarly, in terms of pages where date of posting/update is not available Google on an average has provided the maximum of 37.5% results, followed by Altavista with 35% results, Alltheweb with 30% results and Teoma with minimum of 27.5% results in LIS.

 

 

6.         In Environmental Science Google has provided the maximum output of 1,720,500 results, followed by Altavista with total output of 595,250 results and Alltheweb with total output of 539,250 results. Teoma has provided the minimum total output of 89,150 results. In terms of relevancy Google has provided 40%R results, 25%PR results, 17.5%DR results, 15%IR results and only 2.5%PNF results.  Altavista has provided 42.5%R results, 32.5%PR results, 10%DR results, 12.5%IR results and 2.5%PNF results.  Teoma has provided 37.5%R results, 32.5%PR results, 7.5%DR results, and 22.5%IR results.  Alltheweb has provided 42.5%R results, 32.5%PR results, 10%DR results, 12.5%IR results and 2.5%PNF results. Thus it is clear that Alltheweb and Altavista both have provided not only equal 42.5%R results but have also provided equal results in all other category of relevancy.   A close analysis shows that to find relevant information in environmental science one should use either Altavista or Alltheweb. 

 

 

7.         Except Teoma all other three search engines have provided more relevant results in first five outputs whereas Teoma has provided more relevant results in last five outputs. 

 

8.         As far as currency of web pages are concerned in Environmental Science Google has provided the maximum of 32.5% output posted/updated in the year 2005-2006, followed by Teoma with 30%, output and both Altavista and Alltheweb have provided only 25% output.

 

 9.        Similarly, in terms of pages where date of posting/update is not available again Google has provided the maximum of 35% output, followed by both Altavista and Alltheweb with 30% output and Teoma with minimum of 25% output. Google, Altavista and Alltheweb have provided 17.5% output posted/updated in 2003-2004 whereas Teoma has provided only 12.5% output for the same period. However, Google, Altavista and Alltheweb have provided 7.5% output posted/updated in 2001-2002 whereas Teoma has provided 17.5% output for the same period.

 

10.       In Social Sciences also Google has provided the maximum output of 15,200,000 results, followed by Altavista with total output of 7,896,250 results, and Alltheweb with total output of 6,760,500 results.  Search engine Teoma has again, provided the minimum total output of 1,232,600 results.

 

11.       In terms of relevancy Google has provided 45%R results, 25%PR results, 15%DR results and 15%IR results. Altavista has provided 30%R results, 25%PR results, 35%DR results and 10%IR results. Search engine Teoma has provided only 15%R results, 32.5%PR results, 32.5%DR results, and 15%IR results.  Alltheweb has provided 32. 5%R results, 22.5%PR results, 30%DR results and 15%IR results.  Thus a close analysis shows that to find relevant and current information in Social Science one should use either Google or Alltheweb.

 

12.       Google has not provided any www link which is not working. Altavista and Alltheweb have also not provided any www link which is not working. Teoma is the only search engine which has provided 5% such www links which is not working.

 

13.       Except Google rest three search engines on average have provided more relevant results in the last five outputs. Goggle has provided more relevant results in first five outputs. 

 

14.       As far as the currency of web pages are concerned in Social Sciences, Teoma has provided the maximum 40% output posted/updated in the year 2005-2006, followed by Google with 37.5% output and both Altavista and Alltheweb with 25% output of the same period.

15.       In terms of web pages where date of posting /update is not available three search engines namely Google, Altavista and Alltheweb has provided equal 32.5% output.  Teoma has provided the minimum 20% output of such web pages.  




Please Signup to comment on this article