Notes
Slide Show
Outline
1
Search Engines De-Mystified

  • Martha Talman, Instruction Librarian
  • Dixie State College of Utah



  • Utah Library Association Conference
  • St. George, Utah          May 2006
2
 What We’ll Cover Today
  • What is a search engine?
  • What does a search engine do?
  • How does a search engine work?
  • What search techniques can help?
  • Are there any secret tricks?
  • How do I access the invisible web?
  • How are search engines different?
  • Which search engine is the best?





3
   What is a search engine?
  • A search engine is a program that collects web pages and allows keyword searching of those records.
  • Search engines do not search the web.  A search engine searches its database of collected web pages.
  • A search engine is a searchable online database of Internet resources.
  • A search engine is more similar to a library catalog or library database than it is dissimilar.



4
What does a search engine do?
  • “A search engine is a searchable online database of Internet resources. It has several components: search engine software, spider software, an index (database), and a relevancy algorithm (rules for ranking). The search engine software consists of a server or a collection of servers dedicated to indexing Internet Web pages, storing the results and returning lists of pages to match user queries. The spidering software constantly crawls the Web collecting Web page data for the index. The index is a database for storing the data. The relevancy algorithm determines how to rank queries.”
  • “SEO Glossary” http://www.virtechseo.com/seoglossary.htm
5
"Internet search engines were available..."
  • Internet search engines were available before the World Wide Web
  • Just as a web browser connects you to the World Wide Web, a search engine directs you to actual documents on the WWW.
  • No information or documents are housed in a search engine (except “cached pages”) or “on the Internet.”  Information is found on individual computers called servers.
6
"Search engines use many connected..."
  • Search engines use many connected small computers (similar to a desk top computer) to search records. Estimates are that Google uses between 45,000 to 300,000 clustered computers to house its database.
  • Searching in these clustered or linked computers is incredibly fast. Most Google searches are returned in under one second. If a search engine were to actually go out to the World Wide Web and look for web pages matching your search terms – as opposed to looking at representations of web pages in the search engine index – it would take hours or even days to return results.
7
   Eeek!  Spider!
  • A search engine builds its database by using a program called a “spider” or a “robot”.
  • A spider can also be referred to as a robot, crawler, or wanderer. Each search engine uses a spider to build its database.
  • A search engine spider is a computer program that travels the Internet (link to link) locating and retrieving Web documents.
  • It indexes the documents in a database, which is then searched using a search engine.
8
Each search engine is unique
  • Search engines are built on the same information retrieval methods an online catalogs and databases.
  • Each search engine’s spider works differently thereby creating a unique database of web documents that varies in size, content, and indexing methods (including how much of the page is indexed), as well as frequency of indexing
  • The first search engines created huge indices of web pages (cached pages)


9
"In response to a query..."
  • In response to a query (keyword), the search engine returned a list of results based on keyword occurrence and proximity
  • While this method works well with the structured records in catalog and databases, it is not effective when dealing with huge numbers of unstructured documents on the Web
  • Search engine effectiveness is not based solely on number (how much of the Web is being accessed by the spider) but also how the results are determined and displayed
10
   Relevancy Algorithm
  • Google changed everything with PageRank and link analysis
  • Although each search engine’s algorithm is secret, we know that rankings (display order) are different and that they change frequently
  • Some factors in relevancy ranking include
    • Location of keyword (meta tags, title)
    • Frequency of keyword
    • Relation of keywords



11
   Off-the-Page Ranking Criteria
  • Link Analysis
    • By analyzing how pages link to each other, a search engine can both determine what a page is about and whether that page is deemed to be "important" and thus deserving of a ranking boost. Also called popularity.
  • Search engines have to build in protection against spamming (repeating keywords) and artificial links
12
All searching is local?
  • Google has search criteria that considers the location of the searcher.  The search “westminster college” produces differently ranked (displayed) results depending on whether the searcher is in Utah, Missouri, Pennsylvania, London, or Sydney, which all house colleges called Westminster.
  • How is this done?  We’re not sure, but when Google applied for a patent for ProfileRank, a spin off of PageRank, they included the following information:
    • “…different users can be automatically recognized based on the items they access or other characteristics of their access patterns [even if they don’t log in or provide a profile].”
    • “The relevance of a site to user can be accurately characterized by a set of profile ranks, based on the correlation between a sites content and the user’s term-based profile.
13
Other ranking factors (in no specific order)
  • Language
  • Age of document (could be countered by links)
  • Document length (shorter = better)
  • Anchor text of links
  • Uniqueness
  • Change: Along with measuring historical and current rates of change, search engines watch for the amount of content that has changed in a document over the course of updates and which specific pieces of content on a web page have experienced change.
  • Quality and Authority (spelling, grammar?)



14
Wonder why?
  • In Google, two methods can be used to highlight search terms which can help a user determine relevance (and also satisfy a librarian’s never-ending curiosity as to why that strange web page appeared in my search results)
    • Cache:www.ula.org conference
    • Click the “Cached” link on a results list
  • Conference will be highlighted on that page
15
And
  • Default Search
    • The default search in all search engines is “AND”.  This means that the logical operator AND is automatically inserted between keywords.
      • greenhouse effect = greenhouse AND effect
  • To combat this, use phrase searching
    • Put phrases in double quotation marks
    • Phrase searching keeps words together in that order
      • “greenhouse effect” = greenhouse effect
      • “greenhouse effect” kyoto = greenhouse effect AND kyoto
  • “Phrase searching” is the essential tool in web searching!


16
“Fuzzy And”
  • “Fuzzy and” refers to the display of results in search engines. It is used by Google, buy Yahoo says it doesn’t use “fuzzy and”.
  • Even though the default search places the Boolean operator AND between keywords and documents with all terms (Boolean AND) are ranked first, they can be followed by documents containing any terms (Boolean OR). The farther down the results list, the fewer the terms, although at least one should always be present.
17
   Google and Stemming
   (stem, stemmed, stemmer, etc.)
  • Google uses stemming technology. Thus, when it deems it appropriate, Google will search not only for your search terms, but also for words that are similar to some or all of those terms.
  • If you search for dogs dietary needs, Google will also search for dogs diet needs, dog dietary needs, dog diets need and other related variations of your terms.
  • Stemming is unpredictable
  • Stemming does not work with single word searches or within “phrase searches”
18
   Stop the Stemming
  • There may be searches where stemming is counterproductive.
  • To stop Google from stemming a word,  Danny Sullivan of Search Engine Watch.com reports that using the plus ( + ) symbol in front of a word stops stemming.
  • Example: dogs +dietary needs
    • Dogs and needs will still be stemmed, dietary will not
  • All symbols in search engines should be preceded by a space but followed directly by the keyword
19
   Boolean Operator OR
  • All the “big” search engines support the logical operator OR.  OR means any one of these equivalent terms should appear in results
    • When searching for information on bilingual schools
      • School = education = classroom
      • bilingual school OR education OR classroom
      • Bilingual = dual language
      • Bilingual OR dual language school OR education OR classroom
      • Bilingual OR dual language AND school OR education OR  classroom
  • In Google, the logical operator OR can also be replaced by the tilde ( ~ ) where Google does a  synonym search.
      • Bilingual  ~classroom = bilingual education


20
   NOT so fast!
  • Google performs the function of the logical operator NOT by using the minus ( - ) symbol.
    • “nuclear waste” –medical
  • Yahoo! supports all logical operators (AND, OR, NOT, AND NOT if capitalized)
  • As with Google and OR, logical operators must be capitalized in Yahoo!
  • Yahoo! also supports the minus ( - ) sign for NOT and the plus ( + ) sign for AND.
21
Stop Words
  • Stop words are terms the database (or in this case, search engine) ignores in a search
  • Stop words commonly include articles (a, an, the); prepositions (of, at, in, on, to, etc.); interrogatives (who, what, where, when, etc.); and pronouns (it, he, she, we, they, etc.)
  • Stop words are usually not stored in the search engine’s index.  They are replaced by a marker ( * ).
22
Forcing Stop Words
  • Some search engines hint they do search a stop word if it is included in a “phrase search.”  Google does not.  Uncapitalized and unrecognized logical operators are considered stop words.
  • There are many times when searching a stop word as part of a search can be important (book titles, quotes, phrases, etc.).
  • To force the search of a stop word, place the plus
    ( + ) sign in front of it.
    • “+of mice +and men”
    • “+what +not +to wear”
    • “failure +is +not +an option”
23
Google: Searching by Metawords, Advanced Operators & Shortcuts
  • Notice spacing – it sometimes matters
    • define:   Definitions                   define:cookie
    • related:  Similar sites           related:www.dixie.edu
    • link: Sites that link to this site          link:www.ula.org
    • stocks:   Search ticker symbol(s)            stock: zion wfc
    • site:   Searches word within site    library site:www.dixie.edu
    • inurl:     Searches word in URL only       inurl:birkenstock
    • allinurl:  Multiple words                  allinurl: digital reference
    • intitle:    Searches word in title only         intitle:dixie
    • allintitle: Multiple words                   allintitle: val browning
    • calculator

24
Yahoo:  Searching by Metawords, Advanced Operators & Shortcuts
  • Area code area code miami
  • ZIP code    zip code salt lake city
  • Synonym (#4) synonym taboo
  • Facts spider facts
  • Exchange rates convert euro dollar
  • Movie showtimes showtimes provo
  • News news hurricane katrina
  • Patent patent 2612994
  • Quote quote wfc
  • Scores scores yankees
  • Business name & Location pizza hut st. george
  • Basic mathematical calculations. You can add (+), subtract (-), multiply (*) and divide (/). 17 * 4 + 856 / 2 = 496
25
Limiting in Google Advanced Search
  • Domain
    • Word within specific site
    • By domain extension .edu  .gov   .mil  .com  .net   .org
      • Only within  jeter  site:.org
      • Not within   jeter -site:.com
  • Keyword occurrence
    • In title
    • In text
    • In url
  • Date updated (only in advanced search)
    • 3 months, 6 months, past year


26
Limiting in Google Advanced Search
  • Language (only in advanced search)
    • Drop-down list
  • File format
    • Limit to
    • Limit to not
    • .pdf   .ps   .doc   .xls   .ppt   .rtf
    • refworks filetype:ppt


27
Limiting By Domain
  • Search poker
    • Results = dozens of sites to play poker
  • Search poker site:.edu
    • Results much more varied
      • Statistics of poker
      • Probabilities of poker
      • Rules of poker
      • History of the Poker World Series



28
Wildcard Searching in Google
  • Use the asterisk to stand in for a word within a search in quotes or without quotes
    • Search: the * of the thing
      • the principle of the thing
      • the meaning of the thing
      • the spirit of the thing
      • the essence of the thing
      • the gist of the thing
      • the nature of the thing
29
Proximity Searching in Google
  • Use the asterisk to stand in for intervening terms (placeholder)
    • “global warming * kyoto” (within one word)
      • Global Warming Summit in Kyoto
    • “kyoto * global warming” (within one word)
      • Kyoto Protocol Global Warming Treaty
    • yankees * *  home runs  (within two words)
      • Yankees hit 223 home runs
      • Yankees hit the winning home run
    • librarians * * * patriot act (within three words)
      • Librarians on the Patriot Act (0)
      • Librarian Fights Patriot Act (1)
      • Libraries Chafe under USA Patriot Act (2)
      • Librarians’ Assembly Considers USA Patriot Act (3)



30
Numerical Searches
  • ISBNs = enter number  09450612X
  • Patents = patent 2135194
  • UPC Codes = enter number 073333531084
  • VINs = enter number AAAAA999A9AA99999
  • FedEx, UPS, USPS tracking numbers
        = enter number
  • Date range
    • Example: 1903..1914 advertising

31
Qualities of Search Engines
  • Content
    • Size
      • Comprehensiveness
      • % of available web
    • Currency
      • rate of indexing
      • depth of re-indexing
    • Completeness
      • percent of documents included
    • Types of documents
      • especially full-text searching of non-html
    • Speed of searches
    • Consistency of results


32
Qualities of Search Engines
  • Capabilities
    • Basic search options
      • Default search
      • Phrase searching
      • Equivalent terms
        • single words or phrases?
    • Advanced search options
      • Field searching
      • Limit by domain
        • Single domain or more than one?
      • Limit by type of document
      • Limit by language
      • Limit by date updated


33
Qualities of Search Engines
  • Capabilities (continued)
    • Forcing stop words
    • Limit on number of search terms
      • previously 10 in Google, now 32
    • Complexity limits
    • Search within previous results
    • Customize results or customize search
    • Filters
    • Personalization (preferences)



34
Qualities of Search Engines
  • Results Display
    • Ranking
      • Popularity?  Relevancy?
      • Default to adjacency?
      • “Fuzzy and”?
    • Display
      • Highlighted keywords
      • Excerpt display



35
Size Matters
  • Google is bigger, much bigger, than any other search engine
    • Search library in MSN =          98,841,330
    • Search library in Ask.com =  137,410,000
    • Search library in Yahoo! =     686,000,000
    • Search library in Google =  3,150,000,000
  • Still, the results represent only a portion of the World Wide Web, which is only a portion of the Internet. Current estimates are that Google indexes 50% of the visible web.



36
Today, Google is the best
  • PageRank is unique in Google
    • Popularity
    • Importance (traffic, quality of links)
    • Word proximity
    • Word frequency and position
  • Limits
  • Fuzzy searches (synonyms, stemming)
  • Shortcuts
  • Unique databases





37
Problems in Google
  • Number of stop words and clumsiness in forcing stop words
  • Yahoo! has full Boolean logic with nesting!
  • “Fuzzy and” (changing to OR within search results)
  • “Importance” can override other relevancy factors


38
One is not enough
  • Why should you use multiple search engines?  Didn’t we just decide Google is the best?
  • Jux2:  http://www.jux2.com/
  • There is little overlap in search engine results
    • Only 3.1 – 3.5 of the top 10 results are the same when comparing 500 popular searches in Google, Yahoo!, and MSN.
39
But wait, there’s more!
  • According to Jux2, “In a separate test of 91 random searches, we found that Google and Yahoo! share only 23% of their top 100 results. Furthermore, only 4.8 of Google's top 10 results even made Yahoo!'s top 100 (conversely, only 5.4 of Yahoo!'s top 10 made Google's top 100).”
  • Yikes!
40
Pourquoi la différence?
  • Disparity is due to
    • Different databases of web pages
    • Different ranking systems
    • Different indexing methods and rates
  • Think of search engines as stores with
    • Different merchandise
    • Different brands
    • Different service
    • Different prices
  • Shop around!


41
Metasearch or not?  Just say no.
  • Metasearch engines are search engines that query other search engines and then combine the all the results received . In effect, the user is not using just one search engine, but a combination of many search engines at once, supposedly to optimize searching.
  • However, you lose all the functionality we’ve just learned!!!
  • Plus, the metasearch engine ranks and displays results using its own algorithm, which probably isn’t as good as the individual search engines’ systems.
  • You don’t really need more results, just better results.
42
Beyond the search engine
(or where spiders dare not tread)
  • The Invisible Web (also called the deep or hidden web) is the name given to that portion of the World Wide Web that is not indexed by search engines.
    • Current estimates are that Google, by far the largest search engines, only accesses 50% of the visible web
  • It consists of pages which are not linked to by other pages, such as Dynamic Web pages based on responses to database queries.
  • It also includes sites that require registration or otherwise limit access to their pages.
43
How to see the invisible
  • Directories: spider-free, built by humans.
    • Keyword search or browse
    • Search engine subject directories
      • Google Directory (4 million pages)
        • http://www.google.com/dirhp
      • Yahoo! Directory
        • http://dir.yahoo.com/ (2 million pages)
    • LII: Librarians’ Index to the Internet (excellent annotations)
      • Best sites only compiled by public librarians
        • http://www.lii.org/
44
More Directories
    • INFOMINE (excellent annotations)
      • Great sites from academic librarians
        • http://infomine.ucr.edu/
    • AcademicInfo (browsing recommended over keyword search)
      • http://www.academicinfo.net/
    • IPL: Internet Public Library
      • http://www.ipl.org/
    • Digital Librarian
      • http://www.digital-librarian.com/




45
Seeing the Invisible
  • Proprietary databases
    • Pioneer, Utah’s Online Library
    • Individual resources purchased by academic, public, and school libraries
  • Large databases
    • Firstgov http://www.firstgov.gov/
    • American Memory http://memory.loc.gov/ammem/index.html
  • Specialized databases
    • Find in directories
    • Access through library research guides
    • Locate in search engines (search chemicals database)

46
Know your local resources
  • Mountain West Digital Library
    • http://www.lib.utah.edu/digital/mwdl/
  • Utah Digital Newspapers
    • http://www.lib.utah.edu/digital/unews/
  • Utah.gov
    • http://www.lib.utah.edu/digital/unews/
  • Utah Collections Multimedia Encyclopedia
    • http://www.uen.org/ucme/
47
More Google-ing
  • Google Images
    • http://images.google.com/imghp?hl=en
  • Google Scholar
    • http://scholar.google.com/
  • Froogle
    • http://froogle.google.com/frghp?hl=en&tab=wf&q=
  • Google Earth
    • http://earth.google.com/
  • Google Blog Search
    • http://blogsearch.google.com/
  • Google News
    • http://news.google.com/nwshp?hl=en&tab=fn&ned=us&q=


48
Keep Learning
  • Search engines technology is one of the fastest to change in an arena of incredible fast evolution.  To stay current:
    • Learn the search techniques
    • Keep yourself informed
    • Try different search engines
    • Ask others their “tricks”
    • Experiment
  • Subscribe to the email “Google Newsletter for Librarians”
  • http://www.google.com/librariancenter/index.html
49
Sources
  • UC Berkeley Library http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/SearchEngines.html
  • Search Engine Watch
    http://searchenginewatch.com/
  • Jux2: Compare search engines
    • http://www.jux2.com/
  • Search Engines
    • Advanced Search pages
    • Search Tips / Techniques


50
Handouts & Web Posting
  • The “Googling to the Max” handouts are from the UC Berkeley Teaching Library, an excellent source of information about searching the Internet
    • http://www.lib.berkeley.edu/TeachingLib/Guides/
      Internet/Google.html
  • This presentation will be posted on the DSC Browning Library Web pages under
    New @ Your Library
    • http://library.dixie.edu
51
Presenter Information
  • Martha Talman
    Instruction Librarian
    Val A. Browning Library
    Dixie State College of Utah
    St. George UT  84770
    435.652.7722
    talman@dixie.edu