|
1
|
- Martha Talman, Instruction Librarian
- Dixie State College of Utah
- Utah Library Association Conference
- St. George, Utah May 2006
|
|
2
|
- What is a search engine?
- What does a search engine do?
- How does a search engine work?
- What search techniques can help?
- Are there any secret tricks?
- How do I access the invisible web?
- How are search engines different?
- Which search engine is the best?
|
|
3
|
- A search engine is a program that collects web pages and allows keyword
searching of those records.
- Search engines do not search the web.
A search engine searches its database of collected web pages.
- A search engine is a searchable online database of Internet resources.
- A search engine is more similar to a library catalog or library database
than it is dissimilar.
|
|
4
|
- “A search engine is a searchable online database of Internet resources.
It has several components: search engine software, spider software, an
index (database), and a relevancy algorithm (rules for ranking). The search
engine software consists of a server or a collection of servers
dedicated to indexing Internet Web pages, storing the results and
returning lists of pages to match user queries. The spidering software
constantly crawls the Web collecting Web page data for the index. The index
is a database for storing the data. The relevancy algorithm determines
how to rank queries.”
- “SEO Glossary” http://www.virtechseo.com/seoglossary.htm
|
|
5
|
- Internet search engines were available before the World Wide Web
- Just as a web browser connects you to the World Wide Web, a search
engine directs you to actual documents on the WWW.
- No information or documents are housed in a search engine (except
“cached pages”) or “on the Internet.”
Information is found on individual computers called servers.
|
|
6
|
- Search engines use many connected small computers (similar to a desk top
computer) to search records. Estimates are that Google uses between
45,000 to 300,000 clustered computers to house its database.
- Searching in these clustered or linked computers is incredibly fast.
Most Google searches are returned in under one second. If a search
engine were to actually go out to the World Wide Web and look for web
pages matching your search terms – as opposed to looking at
representations of web pages in the search engine index – it would take
hours or even days to return results.
|
|
7
|
- A search engine builds its database by using a program called a “spider”
or a “robot”.
- A spider can also be referred to as a robot, crawler, or wanderer. Each
search engine uses a spider to build its database.
- A search engine spider is a computer program that travels the Internet
(link to link) locating and retrieving Web documents.
- It indexes the documents in a database, which is then searched using a
search engine.
|
|
8
|
- Search engines are built on the same information retrieval methods an
online catalogs and databases.
- Each search engine’s spider works differently thereby creating a unique
database of web documents that varies in size, content, and indexing
methods (including how much of the page is indexed), as well as
frequency of indexing
- The first search engines created huge indices of web pages (cached
pages)
|
|
9
|
- In response to a query (keyword), the search engine returned a list of
results based on keyword occurrence and proximity
- While this method works well with the structured records in catalog and
databases, it is not effective when dealing with huge numbers of
unstructured documents on the Web
- Search engine effectiveness is not based solely on number (how much of
the Web is being accessed by the spider) but also how the results are
determined and displayed
|
|
10
|
- Google changed everything with PageRank and link analysis
- Although each search engine’s algorithm is secret, we know that rankings
(display order) are different and that they change frequently
- Some factors in relevancy ranking include
- Location of keyword (meta tags, title)
- Frequency of keyword
- Relation of keywords
|
|
11
|
- Link Analysis
- By analyzing how pages link to each other, a search engine can both
determine what a page is about and whether that page is deemed to be
"important" and thus deserving of a ranking boost. Also
called popularity.
- Search engines have to build in protection against spamming (repeating
keywords) and artificial links
|
|
12
|
- Google has search criteria that considers the location of the
searcher. The search “westminster
college” produces differently ranked (displayed) results depending on
whether the searcher is in Utah, Missouri, Pennsylvania, London, or
Sydney, which all house colleges called Westminster.
- How is this done? We’re not sure,
but when Google applied for a patent for ProfileRank, a spin off of
PageRank, they included the following information:
- “…different users can be automatically recognized based on the items
they access or other characteristics of their access patterns [even if
they don’t log in or provide a profile].”
- “The relevance of a site to user can be accurately characterized by a
set of profile ranks, based on the correlation between a sites content
and the user’s term-based profile.
|
|
13
|
- Language
- Age of document (could be countered by links)
- Document length (shorter = better)
- Anchor text of links
- Uniqueness
- Change: Along with measuring historical and current rates of change,
search engines watch for the amount of content that has changed in a
document over the course of updates and which specific pieces of content
on a web page have experienced change.
- Quality and Authority (spelling, grammar?)
|
|
14
|
- In Google, two methods can be used to highlight search terms which can
help a user determine relevance (and also satisfy a librarian’s
never-ending curiosity as to why that strange web page appeared in my
search results)
- Cache:www.ula.org conference
- Click the “Cached” link on a results list
- Conference will be highlighted on that page
|
|
15
|
- Default Search
- The default search in all search engines is “AND”. This means that the logical operator AND
is automatically inserted between keywords.
- greenhouse effect = greenhouse AND effect
- To combat this, use phrase searching
- Put phrases in double quotation marks
- Phrase searching keeps words together in that order
- “greenhouse effect” = greenhouse effect
- “greenhouse effect” kyoto = greenhouse effect AND kyoto
- “Phrase searching” is the essential tool in web searching!
|
|
16
|
- “Fuzzy and” refers to the display of results in search engines. It is
used by Google, buy Yahoo says it doesn’t use “fuzzy and”.
- Even though the default search places the Boolean operator AND between
keywords and documents with all terms (Boolean AND) are ranked first,
they can be followed by documents containing any terms (Boolean OR). The
farther down the results list, the fewer the terms, although at least
one should always be present.
|
|
17
|
- Google uses stemming technology. Thus, when it deems it appropriate,
Google will search not only for your search terms, but also for words
that are similar to some or all of those terms.
- If you search for dogs dietary needs, Google will also search for dogs diet
needs, dog dietary needs, dog diets need and other related variations of
your terms.
- Stemming is unpredictable
- Stemming does not work with single word searches or within “phrase
searches”
|
|
18
|
- There may be searches where stemming is counterproductive.
- To stop Google from stemming a word,
Danny Sullivan of Search Engine Watch.com reports that using the
plus ( + ) symbol in front of a word stops stemming.
- Example: dogs +dietary needs
- Dogs and needs will still be stemmed, dietary will not
- All symbols in search engines should be preceded by a space but followed
directly by the keyword
|
|
19
|
- All the “big” search engines support the logical operator OR. OR means any one of these equivalent
terms should appear in results
- When searching for information on bilingual schools
- School = education = classroom
- bilingual school OR education OR classroom
- Bilingual = dual language
- Bilingual OR dual language school OR education OR classroom
- Bilingual OR dual language AND school OR education OR classroom
- In Google, the logical operator OR can also be replaced by the tilde ( ~
) where Google does a synonym
search.
- Bilingual ~classroom =
bilingual education
|
|
20
|
- Google performs the function of the logical operator NOT by using the
minus ( - ) symbol.
- Yahoo! supports all logical operators (AND, OR, NOT, AND NOT if
capitalized)
- As with Google and OR, logical operators must be capitalized in Yahoo!
- Yahoo! also supports the minus ( - ) sign for NOT and the plus ( + )
sign for AND.
|
|
21
|
- Stop words are terms the database (or in this case, search engine)
ignores in a search
- Stop words commonly include articles (a, an, the); prepositions (of, at,
in, on, to, etc.); interrogatives (who, what, where, when, etc.); and pronouns
(it, he, she, we, they, etc.)
- Stop words are usually not stored in the search engine’s index. They are replaced by a marker ( * ).
|
|
22
|
- Some search engines hint they do search a stop word if it is included in
a “phrase search.” Google does
not. Uncapitalized and
unrecognized logical operators are considered stop words.
- There are many times when searching a stop word as part of a search can
be important (book titles, quotes, phrases, etc.).
- To force the search of a stop word, place the plus
( + ) sign in front of it.
- “+of mice +and men”
- “+what +not +to wear”
- “failure +is +not +an option”
|
|
23
|
- Notice spacing – it sometimes matters
- define: Definitions define:cookie
- related: Similar sites related:www.dixie.edu
- link: Sites that link to this site link:www.ula.org
- stocks: Search ticker
symbol(s) stock: zion
wfc
- site: Searches word within
site library
site:www.dixie.edu
- inurl: Searches word in URL
only inurl:birkenstock
- allinurl: Multiple words allinurl: digital
reference
- intitle: Searches word in
title only intitle:dixie
- allintitle: Multiple words allintitle: val
browning
- calculator
|
|
24
|
- Area code area code miami
- ZIP code zip code salt lake
city
- Synonym (#4) synonym taboo
- Facts spider facts
- Exchange rates convert euro dollar
- Movie showtimes showtimes provo
- News news hurricane katrina
- Patent patent 2612994
- Quote quote wfc
- Scores scores yankees
- Business name & Location pizza hut st. george
- Basic mathematical calculations. You can add (+), subtract (-), multiply
(*) and divide (/). 17 * 4 + 856 / 2 = 496
|
|
25
|
- Domain
- Word within specific site
- By domain extension .edu
.gov .mil .com
.net .org
- Only within jeter site:.org
- Not within jeter -site:.com
- Keyword occurrence
- Date updated (only in advanced search)
- 3 months, 6 months, past year
|
|
26
|
- Language (only in advanced search)
- File format
- Limit to
- Limit to not
- .pdf .ps .doc .xls .ppt .rtf
- refworks filetype:ppt
|
|
27
|
- Search poker
- Results = dozens of sites to play poker
- Search poker site:.edu
- Results much more varied
- Statistics of poker
- Probabilities of poker
- Rules of poker
- History of the Poker World Series
|
|
28
|
- Use the asterisk to stand in for a word within a search in quotes or
without quotes
- Search: the * of the thing
- the principle of the thing
- the meaning of the thing
- the spirit of the thing
- the essence of the thing
- the gist of the thing
- the nature of the thing
|
|
29
|
- Use the asterisk to stand in for intervening terms (placeholder)
- “global warming * kyoto” (within one word)
- Global Warming Summit in Kyoto
- “kyoto * global warming” (within one word)
- Kyoto Protocol Global Warming Treaty
- yankees * * home runs (within two words)
- Yankees hit 223 home runs
- Yankees hit the winning home run
- librarians * * * patriot act (within three words)
- Librarians on the Patriot Act (0)
- Librarian Fights Patriot Act (1)
- Libraries Chafe under USA Patriot Act (2)
- Librarians’ Assembly Considers USA Patriot Act (3)
|
|
30
|
- ISBNs = enter number 09450612X
- Patents = patent 2135194
- UPC Codes = enter number 073333531084
- VINs = enter number AAAAA999A9AA99999
- FedEx, UPS, USPS tracking numbers
= enter number
- Date range
- Example: 1903..1914 advertising
|
|
31
|
- Content
- Size
- Comprehensiveness
- % of available web
- Currency
- rate of indexing
- depth of re-indexing
- Completeness
- percent of documents included
- Types of documents
- especially full-text searching of non-html
- Speed of searches
- Consistency of results
|
|
32
|
- Capabilities
- Basic search options
- Default search
- Phrase searching
- Equivalent terms
- Advanced search options
- Field searching
- Limit by domain
- Single domain or more than one?
- Limit by type of document
- Limit by language
- Limit by date updated
|
|
33
|
- Capabilities (continued)
- Forcing stop words
- Limit on number of search terms
- previously 10 in Google, now 32
- Complexity limits
- Search within previous results
- Customize results or customize search
- Filters
- Personalization (preferences)
|
|
34
|
- Results Display
- Ranking
- Popularity? Relevancy?
- Default to adjacency?
- “Fuzzy and”?
- Display
- Highlighted keywords
- Excerpt display
|
|
35
|
- Google is bigger, much bigger, than any other search engine
- Search library in MSN =
98,841,330
- Search library in Ask.com =
137,410,000
- Search library in Yahoo! =
686,000,000
- Search library in Google =
3,150,000,000
- Still, the results represent only a portion of the World Wide Web, which
is only a portion of the Internet. Current estimates are that Google
indexes 50% of the visible web.
|
|
36
|
- PageRank is unique in Google
- Popularity
- Importance (traffic, quality of links)
- Word proximity
- Word frequency and position
- Limits
- Fuzzy searches (synonyms, stemming)
- Shortcuts
- Unique databases
|
|
37
|
- Number of stop words and clumsiness in forcing stop words
- Yahoo! has full Boolean logic with nesting!
- “Fuzzy and” (changing to OR within search results)
- “Importance” can override other relevancy factors
|
|
38
|
- Why should you use multiple search engines? Didn’t we just decide Google is the
best?
- Jux2: http://www.jux2.com/
- There is little overlap in search engine results
- Only 3.1 – 3.5 of the top 10 results are the same when comparing 500
popular searches in Google, Yahoo!, and MSN.
|
|
39
|
- According to Jux2, “In a separate test of 91 random searches, we found
that Google and Yahoo! share only 23% of their top 100 results.
Furthermore, only 4.8 of Google's top 10 results even made Yahoo!'s top
100 (conversely, only 5.4 of Yahoo!'s top 10 made Google's top 100).”
- Yikes!
|
|
40
|
- Disparity is due to
- Different databases of web pages
- Different ranking systems
- Different indexing methods and rates
- Think of search engines as stores with
- Different merchandise
- Different brands
- Different service
- Different prices
- Shop around!
|
|
41
|
- Metasearch engines are search engines that query other search engines
and then combine the all the results received . In effect, the user is
not using just one search engine, but a combination of many search
engines at once, supposedly to optimize searching.
- However, you lose all the functionality we’ve just learned!!!
- Plus, the metasearch engine ranks and displays results using its own
algorithm, which probably isn’t as good as the individual search
engines’ systems.
- You don’t really need more results, just better results.
|
|
42
|
- The Invisible Web (also called the deep or hidden web) is the name given
to that portion of the World Wide Web that is not indexed by search
engines.
- Current estimates are that Google, by far the largest search engines,
only accesses 50% of the visible web
- It consists of pages which are not linked to by other pages, such as
Dynamic Web pages based on responses to database queries.
- It also includes sites that require registration or otherwise limit
access to their pages.
|
|
43
|
- Directories: spider-free, built by humans.
- Keyword search or browse
- Search engine subject directories
- Google Directory (4 million pages)
- http://www.google.com/dirhp
- Yahoo! Directory
- http://dir.yahoo.com/ (2 million pages)
- LII: Librarians’ Index to the Internet (excellent annotations)
- Best sites only compiled by public librarians
|
|
44
|
- INFOMINE (excellent annotations)
- Great sites from academic librarians
- AcademicInfo (browsing recommended over keyword search)
- http://www.academicinfo.net/
- IPL: Internet Public Library
- Digital Librarian
- http://www.digital-librarian.com/
|
|
45
|
- Proprietary databases
- Pioneer, Utah’s Online Library
- Individual resources purchased by academic, public, and school
libraries
- Large databases
- Firstgov http://www.firstgov.gov/
- American Memory http://memory.loc.gov/ammem/index.html
- Specialized databases
- Find in directories
- Access through library research guides
- Locate in search engines (search chemicals database)
|
|
46
|
- Mountain West Digital Library
- http://www.lib.utah.edu/digital/mwdl/
- Utah Digital Newspapers
- http://www.lib.utah.edu/digital/unews/
- Utah.gov
- http://www.lib.utah.edu/digital/unews/
- Utah Collections Multimedia Encyclopedia
|
|
47
|
- Google Images
- http://images.google.com/imghp?hl=en
- Google Scholar
- http://scholar.google.com/
- Froogle
- http://froogle.google.com/frghp?hl=en&tab=wf&q=
- Google Earth
- Google Blog Search
- http://blogsearch.google.com/
- Google News
- http://news.google.com/nwshp?hl=en&tab=fn&ned=us&q=
|
|
48
|
- Search engines technology is one of the fastest to change in an arena of
incredible fast evolution. To
stay current:
- Learn the search techniques
- Keep yourself informed
- Try different search engines
- Ask others their “tricks”
- Experiment
- Subscribe to the email “Google Newsletter for Librarians”
- http://www.google.com/librariancenter/index.html
|
|
49
|
- UC Berkeley Library http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/SearchEngines.html
- Search Engine Watch
http://searchenginewatch.com/
- Jux2: Compare search engines
- Search Engines
- Advanced Search pages
- Search Tips / Techniques
|
|
50
|
- The “Googling to the Max” handouts are from the UC Berkeley Teaching
Library, an excellent source of information about searching the Internet
- http://www.lib.berkeley.edu/TeachingLib/Guides/
Internet/Google.html
- This presentation will be posted on the DSC Browning Library Web pages
under
New @ Your Library
|
|
51
|
- Martha Talman
Instruction Librarian
Val A. Browning Library
Dixie State College of Utah
St. George UT 84770
435.652.7722
talman@dixie.edu
|