Search Engines

 

            Search Engines provide the best navigational tool on the World Wide Web. In some ways, using a search engine is very simple: users type in keywords, and the search engine returns relevant sites. Because of the vast number of Web pages currently on the Internet, however, searches typically return an unusable amount of data (many searches return over 10,000 “hits”). By understanding how search engines work, and a few simple search strategies, teachers and students can make more effective use of these tools, especially when combined with some form of evaluation criteria to help judge the reliability of the information found.

How Search Engines Work

            Search engines do not search through pages posted on the World Wide Web. There are literally millions of Web pages currently posted, and it would take even the fastest search engine a very long time to search through them all. Typically, a search engine collects a database of Web page information, and searches through that database when a user types in a keyword. Search engines collect the information for this database in one of two ways. First, most engines allow users to submit new URL’s. Thus, when a user has finished creating a new Web site, they would go to a search engine page and fill out a little form that puts the information for the new page into that search engines database. In general, each search engine uses a different database, so users who want their pages to appear in a variety of search engines must submit their Web page information to each one separately. A second way some search engines collect information for their database is by using a little program called a robot (or worm). This robot basically travels through the Web looking for new pages that fit certain criteria. When the robot finds a new page, it automatically sends the information from that page back to the database.

            The databases that search engines compile can contain different kinds of information about the Web pages. Some databases perform full text searches of the pages in their database (in other words, if the search term is found anywhere in the Web page, the page will be returned as a positive find, or “hit”). Other databases compile only information included in the “head” of the document. All Web pages are divided into two sections: the “head” and the “body.” The “body” of the Web page is what produces the actual page that users will see. The “head” of the page contains information like the title and author of the page, other technical information about the page, and keywords. The author of the Web page, not the search engine, supplies these keywords, so their accuracy and relevance are a function of the integrity and knowledge of the author.  If a Web page does not include keywords in its “head”, the search engine might automatically assign keywords for that document that may or may not be accurate.

Most search engines use this information provided in the “head” of the Web page in their databases. When a user searches for a term, therefore, the engine will search through its database, first looking at page titles, and then looking at keyword descriptions of pages. When the search engine finds all the sites in its database that match the term entered by the user, it returns that information to the user, and usually organizes the results in order of relevance.

The exact method by which search engines determine relevance varies. Typically, a search engine will look through the “head” of the document and count how many times the terms appears, and where it appears. So, for example, if the term appears in the title of the document and the keywords of the page, and is repeated throughout the text of the “body” of the page, the site will probably be given a high relevance. If the term appears in the domain name of the site (www.term.com), the site will receive an even higher rating. Some engines also consider cross-linking; that is, sites that are commonly referenced by other pages will receive a higher relevance rating, based on the assumption that a commonly accessed page is valuable.

Choosing a Search Engine

            Each search engine uses a different database of information, so search results can differ greatly depending on the engine chosen. No one search engine is necessarily better than any other, and users will eventually have to choose through trial and error. Nevertheless, some search engines work in different ways. Yahoo and Snap, for example, work as directory structures. These engines have organized their links into a series of nested categories, and when users search for a given term, they channel the user’s search into the categories in which they have categorized that term. For example, if a user searches for “MacBeth,” Yahoo will take the user to the following nested folders:

Arts: Humanities: Literature: Genres: Drama: Playwrights: Shakespeare, William (1564-1616): Works: Macbeth. A similar search on Snap results in a listing of 6 or 7 categories, including literature, entertainment, education, and dogs. Each of these search engines then lists uncategorized hits, according to the order of relevance discussed above. Other search engines, such as Alta Vista, Excite, and WebCrawler, do not pre-categorize their databases, so hits are based only on the order of relevance. A search for MacBeth on Excite yielded about 10,000 hits, while the same search on Alta Vista yielded just over 5,000, and each “rated” their results differently.

 

Refining a Search

            Search engines also allows users to refine their searches in a number of ways. The most common search refinement technique is through BOOLEAN OPERATORS, such as AND, OR, NOT. These terms refine searches by linking terms together or excluding certain terms from the search. So, for example, if we were searching for MacBeth, and kept seeing a MacBeth marketing agency returned in our hits, we could add a NOT operator and exclude the term “agency.” This would exclude any hits that included the term “agency” from the list of hits. Conversely, we could add the operator AND to include the term “Shakespeare” and we would receive only hits that included both the terms MacBeth and Shakespeare. Most search engines include a “refine search” link, where these operators and options are explained in greater detail. Another aspect of refining search techniques concerns the use of quotation marks. Normally, searches are not capital-sensitive (the engine doesn’t read capitalization). If a user uses quotation marks, however, the engine will look for the term(s) exactly as specified in the quotation marks. Thus, if we searched for “MacBeth plaY” including the quotation marks, the engine would return only those pages that included the entire phrase “MacBeth plaY” spelled as such. We would not likely find any results.

            One final detail about search engine operations: search engines ignore some very common words, such as articles, conjunctions, and versions of the verb “be.”  If a user searched for “MacBeth” “and” “play,” the engine would ignore the conjunction, so would return the same hits as the user would have received for “MacBeth” “play.”

 

Search Techniques

            There are three general techniques to performing effective searched on the Web: a domain-name search, a specific content search, and a context search.

Domain-Name Search

            The first technique, a domain-name search, isn’t really a search technique at all; it’s a process of educated guessing. This process entails guessing what the name of the site might be, and simply typing that name as the URL. For example, if a user was looking for Pepsi’s Web site, a good educated guess would lead them to type www.pepsi.com, and chances are excellent that this guess would yield good results. If a user was searching for the home page of the White House, a good educated guess might lead them to type in www.whitehouse.gov (remembering that the “gov” domain is reserved for governmental use). This search strategy only works for general topics and businesses or products. Users looking for more specific information should not use this strategy.

Specific Content Search

A specific content search is the more obvious search strategy, and involves using a search engine to find a page that relates to a specific topic. The key to a successful content search is specific and targeted search terms. For example, if a student was looking for information about the Orson Welles movie version of MacBeth, they would want to choose terms that would specifically target relevant sites. Instead of simply typing in “Shakespeare” or “MacBeth” (which would produce a large number of vague or irrelevent hits), the student would type in “Shakespeare,” “MacBeth,” “Orson Welles,” and “Movies.” This list of terms would direct the search engine to find all the pages that have ANY of the terms listed above (and thus increase the total number of hits).  When the engine returns the hits to the user, however, the results would be ranked in order of relevance. The first hits would be those pages that included most or all of the terms specified in the search (and sub-categorized according to the criteria of relevance discussed above), so while the total number of hits might be daunting, the first hits listed are more likely to contain the information sought. In many cases, therefore, it is actually beneficial to list as many search terms as possible that relate to the topic in question. Essentially, the user is second-guessing the keyword description of the Web pages supplied by the author or the search engine. 

Despite these strategies, however, specific content searches are not always immediately successful. Some pages don’t supply keywords, or choose keywords that do not match users’ expectations. Some search engines will insist on giving relevance to sites that are unrelated to the user’s topic. Some topics are simply not covered on the Internet, or are so rare that the one relevant site will be buried under 3000 irrelevant hits. No user should search through more than the first 50 to 100 hits supplied by a search engine; if the search terms do not yield fairly immediate results, then the user should change or refine the search terms, try a different search engine, or try a context search.

Context Search

The third search strategy is a context search. Using this strategy, users do not look for pages that are directly relevant to their topic, but rather search for pages which might provide links to their topic. This technique is similar to looking through the bibliography of a book only partially relevant to the topic in order to find references to a text which is relevant. Context searches are more economical than specific content searches, because the user is relying upon someone else to have done some of their work for them.

There are two basic versions of a context search. First, users can use search engines such as Yahoo or Snap that pre-categorize topics. Users can browse through the topic listing, and narrow down the search by following the most appropriate path (see chapter 13 for a sample lesson plan using a context search to help students refine paper topics). For example, a user searching for information about MacBeth on Yahoo might begin by clicking on the “Arts and Humanities” category, and then selecting “humanities,” then “literature,” then “authors,” and then search only that category for MacBeth. The resulting hits would be derived only from the Web pages categorized into that section of the database. The second type of context search involves searching for meta-sites; that is, sites that list links to other sites. This kind of search can be accomplished easily by adding terms such as “links,” “list,” “directory,” or “resources” to a content search. For example, if we did a search for “MacBeth,” “Shakespeare,” “resources,” and “links,” our results would list sites that contain links to pages about Shakespeare and MacBeth as well as pages directly concerning MacBeth. This simple procedure makes searching much easier because it allows the user to access existing organizational resources on the Web.

 

We recommend that for most searches, users begin with a content search that includes a variety of keywords. If such a search does not provide immediate results, the user can try to refine their terms, or use the same terms on a different search engine, such as an engine with a directory structure. If these tactics still do not work, the user can add terms such as “resources” or “links” to their keyword list and find a page that might include a link to the information desired.

While blind searching is very easy, effective searching is an acquired skill. The techniques discussed in this chapter should help users browse through the vast amount of information on the Web, but teachers should realize that effective searching is a skill that must be taught.  Teachers should not throw students onto the Web blindly and expect them to find anything useful in a timely fashion. While some students may get lucky and find information quickly, most will not, and may become frustrated. Teachers who want students to locate specific information should either take the time to teach searching skills, or provide several good starting points, such as topic-specific resource pages or directory listings.

 

Evaluating Search Results

            Once users have successfully found Web pages that seem relevant to the topic, the user should then evaluate those pages. There are several different criteria that can be used to evaluate an Internet document, and some may apply more or less depending upon the nature of the topic and page in question. The general list of elements to check include:

·        The page content and design

·        The document URL

·        Attribution (date and author)

·        Outside references/ citations

 

 

 

 

 

 

Page Content and Design

The most obvious place to start an evaluation of any source of information is with the content of the text, which can be evaluated with the same techniques used for printed information. Focus, word-choice, editorial correctness, organization, citation method, and tone are components of every text (especially if one or more of those elements seems to be missing) that lend to its impression of credibility. Academic papers posted on the Web will follow the conventions of their respective disciplines for the most part (particularly in this early era of Web document production), as will newspaper articles and public organization documents. Users should pay special attention to clues of bias in the text, as well as the existence of seemingly incongruous material, such as banners or advertisements. This examination of a page’s content corresponds to two categories of traditional evaluation criteria used for printed documents: “accuracy” and “objectivity.”

A bit trickier than information, the design of a page lends few clues about the reliability of the page. Clearly, design does not translate directly into reliability; anyone with some time and money can produce a great looking page about how Abraham Lincoln was in fact a Martian. Conversely, a scholar with little or no experience with Web design might well post a page that violates every rule of page design but which nevertheless contains valuable information.

            The quality of the Web design (whether or not the site is easy to find and navigate, whether it includes a lot of “showy” graphics, or if form is subordinated to content, whether the overall look is appealing) does sometimes correspond to quality of content. The logic here would be that only departments or institutions particularly interested in a topic would devote the time and resources necessary to create a decent Web site. Users should be very cautious about the weight they give to this particular criterion.

            One aspect of design that is important to mention concerns navigation. Web pages found through a search engine or a list of hyperlinks can sometimes link a user to a page in the middle of an integrated Web site (in other words, the link takes the user to a page which, viewed by itself, is out of context). In such cases, users may find a “home” button included somewhere in the document that links the user to the main entry page for the site as a whole.

 

URL Evaluation

The URL of the page can tell users quite a bit about the reliability of the information. One clue is to look and see if the server domain name has been reserved for the topic in question. For example, if a student was doing research on the White House, a site within the URL www.whitehouse.gov is probably a fairly reliable place to start looking. As we discussed earlier in this chapter, the “gov” tells us that the server is governmental, so probably had miles of red tape to go through before it went public. Users can assume anything from a “gov” site is fairly reliable information. A site from www.whitehouse.com, however, should make users suspicious, as they realize that “com” indicates a commercial site.

Most sites have longer URL’s that contain information about who and where the information comes from. The file path (see chapter 9 for more information on URL’s and file paths) section of the URL essentially consists of directories and subdirectories, whose names might reveal important information. Take, for example, the following URL: http://www.niu.edu/english/classes/ceh/main.html. From this URL, we can conclude that the file in question is contained in a folder devoted to someone’s English class at Northern Illinois University. We deduce this information by looking at each element of the URL, paying special attention to any logic inherent in the file structure. Here, the nested folders “english”, “classes”, and “ceh” are probably a pretty reliable guide to the information source. These names are in fact arbitrary (folders can be named using any legal characters), but Web administrators usually like to keep things as simple and organized as possible.

Users who are unsure of the relationship between a page and the server on which it is posted can “backtrack” through the URL. “Backtracking” is a process of moving up the virtual hierarchy, whereby the user moves up one folder at a time in an attempt to identify a controlling rubric. Take, for example, the same URL: http://www.niu.edu/english/classes/ceh/main.html. If the relationship between the file and the server were not obvious, we could erase the last part of the URL (“main.html”), and see what the containing folder tells us. We can keep moving up the hierarchy by erasing one folder at a time and see if we get any more information. So, in the example above, we might try http://www.niu.edu/english/classes/ to see other classes with Web pages or http://www.niu.edu/english/ to see what general information we can find about the English department at NIU. Backtracking does not always work, but sometimes it’s the only way to explore the relationship between the server and any contained pages. Note that there is no necessary relationship between the URL and the content of a page. In the example above, it could very well turn out that the page we find at http://www.niu.edu/english/classes/ceh/main.html has nothing at all to do with NIU, English, or courses. Many servers (such as www.geocities.com) are simply public posting areas, and tell you nothing about the source or value of any information published on that server. [1]

Information derived from a document URL corresponds to the traditional print-based evaluation criterion of “authority” insofar as the URL represents the “publisher” of the page, and the relationship (if any) between the publisher and the author of the text.

 

Attribution (date and author) and Contacts

 

Though in some ways an element of Web page design, the existence of proper attribution on a page is very important for evaluation purposes. Typically, pages will include author and publication information at the bottom of the Web page, and usually, the author’s name is a mailto hyperlink (when users click on the name, an e-mail window opens pre-addressed to that person). One of the best and easiest ways for students to evaluate the content of a page is by e-mailing the author of the page (or Webmaster of the site) and asking where he or she got their information. Most Web authors will reply to such queries.

            Pages that list neither author nor date of publication should be viewed with a healthy amount of skepticism, but students can try to “backtrack” the URL a bit to see if the enclosing folder contains this information. Often, only the “entry” page of a Web site will contain the authors name and date of publication, and “backtracking” the URL should take the student to the entry page.

            Once students have identified the author of the page (if possible), they should evaluate the author’s credentials in relation to the material at hand. Again, if such credentials do not appear listed in the page itself, students may need to “backtrack” the URL to try and gather more information about the author. (In the URL cited above, students may want to find out about the teacher of the course, and what credentials, if any, might lead us to believe they know what they are talking about. If this information is not listed in the page, students could try to “backtrack” to the “class” folder or even the main “english” folder to try and get more information about the author).

            The date of publication on the page can also tell the user a bit about the information supplied. Typically, well-designed pages include at least two dates: the date of original publication and the last time the page was updated. Both of these pieces of information are important when evaluating Internet resources. The “last updated” information can tell the reader how current the page is, though users should also read the text itself carefully for such clues (Web authors frequently forget to update the “last updated” section of their page).  The date of publication tells the user how long the page has been up, and can be a useful clue to the reliability of the page. Many pages are extremely transient, and most do not last a year. Pages published a year or more ago, therefore, are probably going to be around for a little bit longer. Note that “backtracking” the URL can be useful here to determine the age of the hosting site. A page on Artificial Intelligence may have a publication date of one week ago (thus conveying an impression of current information), but be part of a site devoted to AI that has been around since 1995 (which, in computer time, is about a century). These elements of attribution correspond to two of the evaluation criteria traditionally used for print documents: “authority” (to determine the qualifications of the author of the text) and “currency” (to determine the timely relevance of the information).

 

Outside references/ citations

            Just as print documents often contain links to outside resources in the form of  bibliographies and footnotes, good Web pages contain hyperlinks to related resources. Such links confirm that the information included in the page is part of a network of resources, and users should visit at least some of these other sites, as well as related print documents, to compare the quality and accuracy of information. Conversely, users who find the same site being referenced numerous times may wish to visit that site. 

            One important note about “outside references” should be mentioned here. One of the traditional criteria for evaluating print documents concerns “coverage”; that is, the amount and quality of information included about the topic.  While appropriate for evaluating printed information, this criterion is less useful for evaluating Internet resources because of the nature of the medium. Whereas printed texts tend to be somewhat self-enclosed and holistic (we even say that pages are BOUND into a book), hypertext documents are inherently open. By definition, a hypertext is an

electronic document that includes direct links to other documents. When posted on a server, these links are active, and alter the nature of the reading experience, which becomes less vertical and more lateral. Indeed, Web pages that attempt to act like printed texts (by attempting to include “deep” coverage of a topic on a single page) are examples of bad hypertext design. When we speak of “coverage,” then, we need to take hypertextuality into account. The complete “coverage” of a topic might span many Web pages, and include many different authors, each providing some small part or step of the whole. This is not to suggest we ignore this criterion when evaluating Web pages; it is to suggest that students need to consider hyperlinks to outside resources as part of the
“coverage.”  For example, let us assume that we are reading a page describing recent medical applications of leeches. As we are reading the essay, we notice that the word “leech” is a hyperlink, and when we click on it, we are taken to a separate page (different author, different publisher) that contains detailed biological descriptions of leeches. While the author of the original page did not write this outside resource, he or she was resourceful enough to find and include a link to that detailed information as a component of the essay. The immediacy of the hypertext changes the concept of “coverage” because the boundaries between distinct “texts” are all but gone.

 

Selected Search Engine and Internet Evaluation Resources:

 

 

Abilock, Debbie. Choose the best search engine for your information needs. The Nueva School. August 8, 1996, Latest Revision: 1/19/99. http://www.nueva.pvt.k12.ca.us/~debbie/library/research/adviceengine.html (February, 1999).

 

Barlow, Linda. The Spider’s Apprentice: Tips on Searching the Web. Monash Information Services. Updated 10 Nov 1998. http://www.monash.com/spidap.html (February, 1999).

 

Beck, Susan E. The Good, The Bad and The Ugly, or, Why It's a Good Idea to Evaluate Web Sources. New Mexico State University Library. 1997, Last updated on 10/14/98. http://lib.nmsu.edu/staff/susabeck/eval.html (February, 1999).

 

Ciolek, Dr T.Matthew and Goltz, Irena M.  Information Quality WWW Virtual Library: The Internet Guide to Construction of Quality Online Resources. WWW.CIOLEK.COM Asia Pacific Research Online. 15 Mar 1996, last updated: 26 Jan 1999. http://www.ciolek.com/WWWVL-InfoQuality.html (February, 1999).

 

Cohen, Laura and Jacobson, Trudi. Evaluating Internet Resources. University at Albany Libraries. 4/96. http://www.albany.edu/library/Internet/evaluate.html (February, 1999).

 

Cosgrave, Tony, Engle, Michael, and Ormondroyd, Joan. How to Critically Analyze Information Sources. Reference Services Division, Olin*Kroch*Uris Libraries, Cornell University Library. Revised October 20, 1996. http://www.library.cornell.edu/okuref/research/skill26.htm (February, 1999).

 

Grassian, Esther. Thinking Critically about World Wide Web Resources.  UCLA College Library. 6/95, last updated 10/98. http://www.library.ucla.edu/libraries/college/instruct/Web/critical.htm (February, 1999).

 

Harris, Robert. Evaluating Internet Research Sources. Southern California College. Version Date: November 17, 1997. http://www.sccu.edu/faculty/R_Harris/evalu8it.htm (February, 1999).

 

Hinchliffe, Lisa Janicke. Resource Selection and Information Evaluation. The Graduate School of Library and Information Science at the University of Illinois at Urbana-Champaign. 1994, updated May 29, 1997. http://alexia.lis.uiuc.edu/~janicke/Evaluate.html (February, 1999).

 

Richmond, Betsy. Ten C's for Evaluating Internet Resources. University of Wisconsin-Eau Claire, McIntyre Library. Updated: November 20, 1996 http://www.uwec.edu/Admin/Library/10cs.html (February, 1999).

 

Stegall,Nancy L. Using Cybersources. DeVry Institute of Technology, Online Writing Support Center. Last update: 08/31/98. http://www.devry-phx.edu/lrnresrc/dowsc/integrty.htm (February, 1999).

 

Tillman, Hope N. Evaluating Quality on the Net. TIAC: The Internet Access Company, Inc. November 16, 1998, last revised 2 January 1999. http://www.tiac.net/users/hope/findqual.html (February, 1999).



[1] Because of the importance of this evaluation criterion, we require our students to include information about the relationship between the server and page in the citation page. More specifically, we use the Columbia style citation method (available at http://www.columbia.edu/cu/cup/cgos/idx_basic.html), but have students include information about the hosting server after the name of the site (and collected work, if applicable), exactly as traditional MLA citations include information about the publisher after the book title. See our selected bibliography for examples.