Information & Instructions for Web Site Creators
On this page:
Overview
If you plan to use the Google search appliance on your Stanford web site, please subscribe to search-partners@lists.stanford.edu for notifications of service changes, updates, etc.
How to...
Get pages into the index
Google's indexing robot starts at the top of the Stanford web site and follows links to find all indexable pages within the Stanford search collection. All you need to do to get your web pages into the Stanford/Google index is:
- put the pages up in a web space that is not excluded from the Stanford search collection
- make sure your pages don't contain meta tags that prevent the robot from indexing your page
- make sure your page can be reached by clicking links from one of the top-level pages in the Stanford search collection
There is no need to submit pages to the index: the Google crawler will pick up changed, new, and removed pages automatically.
If a page is not in the index, perform a search for all pages that link to your page. The syntax for this search is: "link:yourdomain.com" For example, to see if pages link to your personal page at Stanford, you would enter "link:http://www.stanford.edu/~mypage" into the google.stanford.edu search box. The results will give you a list of all pages that link to your page.
If you would like your page(s) to be listed in the Stanford index, visit http://www.stanford.edu/home/atoz/index.html and click on the "suggestions" link.
Note that if your web pages do not have any external links from other pages in the Stanford search collection, they won't be picked up by the Google crawler. If there's some reason why you need to keep your web site unlinked from other pages, but still want it to appear in the search index, enter a HelpSU request, and the Search Administrator will help you find a solution.
Keep pages out of the index
If you don't want a page to be indexed, insert this <meta> tag within your page's <head> tag:
<head>
<meta name="robots" content="noindex, nofollow">
</head>
This will prevent crawlers (robots) from indexing the page, and from following any links from the page. If the page has already been indexed, it will be removed from the index the next time Google crawls the page.
You can prevent the pages in a directory from being indexed by restricting access to the directory with webauth. (Note that we may implement authenticated indexing in Google, which may require your .htaccess file to be more specific than "require valid-user" if you want to keep the protected pages out of the Stanford index. We'll announce when, and if, this service change takes place, and how to adjust your security accordingly.)
If you need to get a page out of the index urgently, enter a HelpSU request. Provide the URL of the page you want removed, and indicate whether you want the page suppressed temporarily or permanently.
Put a search box on your site
You can add a search box to your web site to help visitors find content on your site. You can restrict the search to a specified directory, or search the entire Stanford web site. The search box will look and behave like this:
The visitor can enter a search term and click the Search button; the page will leave your site and display the search results in a formatted page.
Insert the following HTML into your web page where you want the search box to appear:
<form method="get" action="http://ask.stanford.edu/search">
<input type="text" name="q" size="32" maxlength="255" value="" />
<input type="submit" name="btnG" value="Search" />
<input type="hidden" name="site" value="stanford" />
<input type="hidden" name="client" value="stanford" />
<input type="hidden" name="proxystylesheet" value="stanford" />
<input type="hidden" name="output" value="xml_no_dtd" />
<input type="hidden" name="as_dt" value="i" />
<input type="hidden" name="as_sitesearch" value="<yoururl>" />
</form>- Customize the following required parameters:
- name="q" size="32" maxlength="255"
Sets the width (in number of characters) of the search box. You can change the size to suit your site's layout, but don't change the maxlength value. - name="btnG" value="Search"
The text that appears on the search button.
e.g.: value="Search Medical School Site" name="client" value="stanford"
name="proxystylesheet" value="stanford" [view search results]name="client" value="stanford_basic"
name="proxystylesheet" value="stanford_basic" [view search results]The client parameter specifies the Google Front End. The proxystylesheet parameter specifies the XSLT Stylesheet. There are two default stylesheets available for the display of search results. The default value="stanford" includes a menu of defined Collections. If you prefer to hide the display of Collections, you can use value="stanford_basic". To specify a custom Front End, specify the parameter value with the name of the Front End.
e.g.: value="stanfordit" [view search results]Note: In most cases, the client value should be identical to the proxystylesheet value. The only time you would want different values for the two parameters is when you want to retain the Front End's KeyMatch, Synonyms, Filters, and Remove URL settings, but change to a different Output Format.
e.g.: name="client" value="stanford", name="proxystylesheet" value="google" [view search results]name="site" value="stanford"
Specifies the Google Collection. The default value="stanford" searches the entire Stanford Google Collection. To search a particular Collection, specify the parameter value with the name of the Collection.
e.g.: value="stanfordit" [view search results]<input type="hidden" name="site" value="<Collection Name>" />- name="output" value="xml_no_dtd"
Specifies the output format. There is no need to modify this value.
- name="q" size="32" maxlength="255"
Customize the following optional parameters:
If you want to restrict your search feature to a specific directory (and its subdirectories), include the following two parameters (as_dt and as_sitesearch).
If you want the search feature on your site to search the entire Stanford collection, remove these two parameters from your HTML.
name="as_dt" value="i"
This setting determines whether your search should include or exclude the directory specified in "as_sitesearch". Values can be:- "i" (include only results in the web directory specified by as_sitesearch)
- "e" (exclude all results in the web directory specified by as_sitesearch)
name="as_sitesearch" value="<yoururl>"
Pages in the specified directory will be included in or excluded from your search (according to the value of "as_dt").
e.g.: name="as_sitesearch" value="www.stanford.edu/dept/drama"- You must specify the complete canonical name of the host server followed by the path of the directory.
e.g.:
- www.stanford.edu/services not www/services
- If the ("/") character is at the end of the web directory path specified, then only files within that directory will be searched and files in sub-directories will not be considered.
e.g.:
- www.stanford.edu/services to include sub-directories
- www.stanford.edu/services/ to exclude sub-directories
- as_sitesearch allows allows you to specify one directory (and all its sub-directories) as the domain to be searched—you cannot specify multiple disparate directories using this option (See additional parameters for options that allow you to specify multiple URLs, and information about requesting a subcollection.)
- If you want the search feature on your site to search the entire Stanford web site, delete this parameter.
- You must specify the complete canonical name of the host server followed by the path of the directory.
name="sitesearch"
The sitesearch parameter limits search results to documents in the specified domain, host, or web directory. Has no effect if the q parameter is empty. This parameter has the same effect as the site special query term. See instructions on using this parameter.Notes: Unlike as_sitesearch, the sitesearch parameter is not affected by the as_dt parameter. The sitesearch and as_sitesearch parameters are handled differently in the XML results. The sitesearch parameter's value is not appended to the search query in the results. The original query term is not modified when you use the sitesearch parameter.
Stanford's Configuration of the Google Search Appliance
Stanford has customized the Google Search Appliance for our environment. The following changes have been made to the default Google Search Appliance configuration.
No caching
The commercial Google search engine caches a copy of each page that was indexed by the crawler. If the contents of a page have been changed since the index was last updated, the user can view the cached version of the page (that is, the page as it existed when it was indexed).
For security and privacy reasons, the Stanford index does not use the caching feature. (Note, however, that Google's University Search of Stanford University does cache pages.)
Search collection
Stanford's search collection includes all the web pages in these domains:
- http://www.stanford.edu
- http://*.stanford.edu (including most virtual URLs such as medicine.stanford.edu)
- http://www.stanfordalumni.org
- http://www.stanfordmag.org
- http://gostanford.com
...that are not specifically excluded by:
- the search administrator
- a noindex <meta> tag in the page's HTML
- password (including webauth) protection
- restricted-access files and/or directories
- dynamically-generated content
Web pages excluded by the search administrator
Web pages in the following directories (and their subdirectories) are excluded from the Stanford search collection:
- Personal web pages
http://www.stanford.edu/~*
http://www.stanford.edu/people - Highwire Press
http://highwire.stanford.edu - The Stanford Picture Tour
http://tour.stanford.edu - the SLAC web site
http://*.slac.stanford.edu - URLs being phased out of use
e.g.: http://www-leland.stanford.edu - webauth-protected (or otherwise restricted-access) pages and directories
- specific pages kept out of the index at the request of their owners
These pages have been excluded for a variety of system performance, copyright, license, and University policy reasons. Some or all of these pages, however, may be indexed by Google's own Stanford University search, which indexes our site approximately once a month.
Additional directories or pages not listed here may have been excluded by the search administrator. If you think your page may have been excluded and don't want it to be, enter a HelpSU request.
Crawling schedule
The search appliance is configured to crawl the entire Stanford web site continuously. If your new page must be included in search results immediately, or if you have questions about the indexing of your content, enter a HelpSU request.


